date:20220727

[jira] [Assigned] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39904:


Assignee: Apache Spark

> Rename inferDate to preferDate and fix an issue when inferring schema
> -
>
> Key: SPARK-39904
> URL: https://issues.apache.org/jira/browse/SPARK-39904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>
> Follow-up for https://issues.apache.org/jira/browse/SPARK-39469.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39904:


Assignee: (was: Apache Spark)

> Rename inferDate to preferDate and fix an issue when inferring schema
> -
>
> Key: SPARK-39904
> URL: https://issues.apache.org/jira/browse/SPARK-39904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up for https://issues.apache.org/jira/browse/SPARK-39469.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572240#comment-17572240
 ] 

Apache Spark commented on SPARK-39904:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37327

> Rename inferDate to preferDate and fix an issue when inferring schema
> -
>
> Key: SPARK-39904
> URL: https://issues.apache.org/jira/browse/SPARK-39904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up for https://issues.apache.org/jira/browse/SPARK-39469.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39844) Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types

2022-07-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-39844:
--

Assignee: Daniel

> Restrict adding DEFAULT columns for existing tables to allowlist of supported 
> data source types
> ---
>
> Key: SPARK-39844
> URL: https://issues.apache.org/jira/browse/SPARK-39844
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39844) Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types

2022-07-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-39844.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37256
[https://github.com/apache/spark/pull/37256]

> Restrict adding DEFAULT columns for existing tables to allowlist of supported 
> data source types
> ---
>
> Key: SPARK-39844
> URL: https://issues.apache.org/jira/browse/SPARK-39844
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39905) Remove checkErrorClass()

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39905:


Assignee: Apache Spark  (was: Max Gekk)

> Remove checkErrorClass()
> 
>
> Key: SPARK-39905
> URL: https://issues.apache.org/jira/browse/SPARK-39905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace all invokes of checkErrorClass() by checkError() and remove 
> checkErrorClass().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39905) Remove checkErrorClass()

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39905:


Assignee: Max Gekk  (was: Apache Spark)

> Remove checkErrorClass()
> 
>
> Key: SPARK-39905
> URL: https://issues.apache.org/jira/browse/SPARK-39905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Replace all invokes of checkErrorClass() by checkError() and remove 
> checkErrorClass().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39905) Remove checkErrorClass()

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572238#comment-17572238
 ] 

Apache Spark commented on SPARK-39905:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37322

> Remove checkErrorClass()
> 
>
> Key: SPARK-39905
> URL: https://issues.apache.org/jira/browse/SPARK-39905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Replace all invokes of checkErrorClass() by checkError() and remove 
> checkErrorClass().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39905) Remove checkErrorClass()

2022-07-27 Thread Max Gekk (Jira)

Max Gekk created SPARK-39905:


 Summary: Remove checkErrorClass()
 Key: SPARK-39905
 URL: https://issues.apache.org/jira/browse/SPARK-39905
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Replace all invokes of checkErrorClass() by checkError() and remove 
checkErrorClass().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema

2022-07-27 Thread Ivan Sadikov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated SPARK-39904:
-
Description: Follow-up for 
https://issues.apache.org/jira/browse/SPARK-39469.

> Rename inferDate to preferDate and fix an issue when inferring schema
> -
>
> Key: SPARK-39904
> URL: https://issues.apache.org/jira/browse/SPARK-39904
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up for https://issues.apache.org/jira/browse/SPARK-39469.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema

2022-07-27 Thread Ivan Sadikov (Jira)

Ivan Sadikov created SPARK-39904:


 Summary: Rename inferDate to preferDate and fix an issue when 
inferring schema
 Key: SPARK-39904
 URL: https://issues.apache.org/jira/browse/SPARK-39904
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39899.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37323
[https://github.com/apache/spark/pull/37323]

> Incorrect passing of message parameters in InvalidUDFClassException
> ---
>
> Key: SPARK-39899
> URL: https://issues.apache.org/jira/browse/SPARK-39899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> In fact, messageParameters is not passed AnalysisException. It is used only 
> to form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-07-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39889.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37313
[https://github.com/apache/spark/pull/37313]

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2022-07-27 Thread wenweijian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572212#comment-17572212
 ] 

wenweijian commented on SPARK-34788:


I got this FileNotFoundException when the 
hdfs(/tmp/logs/root/bucket-logs-tfile) disk is full.

when I delele some file in that dir, the exception disappeared. 

 

logs:
{code:java}
org.apache.spark.shuffle.FetchFailedException: Error in reading 
FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794]
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:649)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
    at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
    at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:128)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Error in reading 
FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794]
    at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:112)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:637)
    ... 23 more
Caused by: java.io.FileNotFoundException: 
/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data
 (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.(FileInputStream.java:138)
    at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:101)
    ... 24 more {code}

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users：
> {code:java}
> 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in reading 
> FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
>  offset=110254956, length=1875458}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
>   at 
>

[jira] [Resolved] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering

2022-07-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39890.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37318
[https://github.com/apache/spark/pull/37318]

> Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
> ---
>
> Key: SPARK-39890
> URL: https://issues.apache.org/jira/browse/SPARK-39890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
> Fix For: 3.4.0
>
>
> AliasAwareOutputOrdering can save a sort if the project inside 
> TakeOrderedAndProjectExec has an alias for the sort order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering

2022-07-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39890:
---

Assignee: XiDuo You

> Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
> ---
>
> Key: SPARK-39890
> URL: https://issues.apache.org/jira/browse/SPARK-39890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.4.0
>
>
> AliasAwareOutputOrdering can save a sort if the project inside 
> TakeOrderedAndProjectExec has an alias for the sort order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39903) Reenable TPC-DS q72 in GitHub Actions

2022-07-27 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-39903:


 Summary: Reenable TPC-DS q72 in GitHub Actions
 Key: SPARK-39903
 URL: https://issues.apache.org/jira/browse/SPARK-39903
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0, 3.4.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/pull/37289 disabled TPC-DS q72 in GitHub 
Actions. We should reenable this to recover the test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572201#comment-17572201
 ] 

Hyukjin Kwon commented on SPARK-39900:
--

Please go ahead for a PR [~Zing]

> Issue with querying dataframe produced by 'binaryFile' format using 'not' 
> operator
> --
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> TLDR;
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
>     //     
> // - src/test/resources/files/
>     //  - test1.csv
>     //  - test2.json
>     //  - test3.txt
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572193#comment-17572193
 ] 

Sumeet commented on SPARK-39902:


An example of this change can be seen while viewing the Iceberg scans on 
SparkUI.
h2. Before this change:

!Screen Shot 2022-07-27 at 6.39.48 PM.png|width=415,height=211!

 
h2. After this change:

!Screen Shot 2022-07-27 at 6.38.56 PM.png|width=430,height=216!

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.39.48 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.38.56 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572192#comment-17572192
 ] 

Apache Spark commented on SPARK-39902:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/37325

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39902:


Assignee: Apache Spark

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39902:


Assignee: (was: Apache Spark)

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Description: 
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

!Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!

 

DSv2

!Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!

  was:
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

 


> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.50 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: (was: Screen Shot 2022-07-27 at 6.00.27 PM.png)

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Description: 
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

 

  was:
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 


> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572186#comment-17572186
 ] 

Sumeet commented on SPARK-39902:


I'm working on it and will publish a patch soon.

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)

Sumeet created SPARK-39902:
--

 Summary: Add Scan details to spark plan scan node in SparkUI
 Key: SPARK-39902
 URL: https://issues.apache.org/jira/browse/SPARK-39902
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.3.1
Reporter: Sumeet


Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38496) Improve the test coverage for pyspark/sql module

2022-07-27 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572180#comment-17572180
 ] 

Haejoon Lee commented on SPARK-38496:
-

I'm working on this

> Improve the test coverage for pyspark/sql module
> 
>
> Key: SPARK-38496
> URL: https://issues.apache.org/jira/browse/SPARK-38496
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, sql module has 90% of test coverage.
> We could improve the test coverage by adding the missing tests for sql module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38493) Improve the test coverage for pyspark/pandas module

2022-07-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38493.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37294
[https://github.com/apache/spark/pull/37294]

> Improve the test coverage for pyspark/pandas module
> ---
>
> Key: SPARK-38493
> URL: https://issues.apache.org/jira/browse/SPARK-38493
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, pandas module (pandas API on Spark) has 94% of test coverage.
> We could improve the test coverage by adding the missing tests for pandas 
> module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check

2022-07-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-39839.
--
Fix Version/s: 3.3.1
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37252
[https://github.com/apache/spark/pull/37252]

> Handle special case of null variable-length Decimal with non-zero 
> offsetAndSize in UnsafeRow structural integrity check
> ---
>
> Key: SPARK-39839
> URL: https://issues.apache.org/jira/browse/SPARK-39839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
> The {{UnsafeRow}} structural integrity check in 
> {{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s 
> supposed to validate that a given {{UnsafeRow}} conforms to the format that 
> the {{UnsafeRowWriter}} would have produced.
> Currently the check expects all fields that are marked as null should also 
> have its field (i.e. the fixed-length part) set to all zeros. It needs to be 
> updated to handle a special case for variable-length {{{}Decimal{}}}s, where 
> the {{UnsafeRowWriter}} may mark a field as null but also leave the 
> fixed-length part of the field as {{OffsetAndSize(offset=current_offset, 
> size=0)}}. This may happen when the {{Decimal}} being written is either a 
> real {{null}} or has overflowed the specified precision.
> Logic in {{UnsafeRowWriter}}:
> in general:
> {code:scala}
>   public void setNullAt(int ordinal) {
>     BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
>     write(ordinal, 0L);                                      // also zero out 
> the fixed-length field
>   } {code}
> special case for {{DecimalType}}:
> {code:scala}
>       // Make sure Decimal object has the same scale as DecimalType.
>       // Note that we may pass in null Decimal object to set null for it.
>       if (input == null || !input.changePrecision(precision, scale)) {
>         BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null 
> bit
>         // keep the offset for future update
>         setOffsetAndSize(ordinal, 0);                            // doesn't 
> zero out the fixed-length field
>       } {code}
> The special case is introduced to allow all {{DecimalType}}s (including both 
> fixed-length and variable-length ones) to be mutable – thus need to leave 
> space for the variable-length field even if it’s currently null.
> Note that this special case in {{UnsafeRowWriter}} has been there since Spark 
> 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was 
> originally added for Structured Streaming’s checkpoint evolution validation, 
> so that a newer version of Spark can check whether or not an older checkpoint 
> file for Structured Streaming queries can be supported, and/or if the 
> contents of the checkpoint file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38493) Improve the test coverage for pyspark/pandas module

2022-07-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38493:


Assignee: Haejoon Lee

> Improve the test coverage for pyspark/pandas module
> ---
>
> Key: SPARK-38493
> URL: https://issues.apache.org/jira/browse/SPARK-38493
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, pandas module (pandas API on Spark) has 94% of test coverage.
> We could improve the test coverage by adding the missing tests for pandas 
> module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check

2022-07-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-39839:


Assignee: Kris Mok

> Handle special case of null variable-length Decimal with non-zero 
> offsetAndSize in UnsafeRow structural integrity check
> ---
>
> Key: SPARK-39839
> URL: https://issues.apache.org/jira/browse/SPARK-39839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
>
> The {{UnsafeRow}} structural integrity check in 
> {{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s 
> supposed to validate that a given {{UnsafeRow}} conforms to the format that 
> the {{UnsafeRowWriter}} would have produced.
> Currently the check expects all fields that are marked as null should also 
> have its field (i.e. the fixed-length part) set to all zeros. It needs to be 
> updated to handle a special case for variable-length {{{}Decimal{}}}s, where 
> the {{UnsafeRowWriter}} may mark a field as null but also leave the 
> fixed-length part of the field as {{OffsetAndSize(offset=current_offset, 
> size=0)}}. This may happen when the {{Decimal}} being written is either a 
> real {{null}} or has overflowed the specified precision.
> Logic in {{UnsafeRowWriter}}:
> in general:
> {code:scala}
>   public void setNullAt(int ordinal) {
>     BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
>     write(ordinal, 0L);                                      // also zero out 
> the fixed-length field
>   } {code}
> special case for {{DecimalType}}:
> {code:scala}
>       // Make sure Decimal object has the same scale as DecimalType.
>       // Note that we may pass in null Decimal object to set null for it.
>       if (input == null || !input.changePrecision(precision, scale)) {
>         BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null 
> bit
>         // keep the offset for future update
>         setOffsetAndSize(ordinal, 0);                            // doesn't 
> zero out the fixed-length field
>       } {code}
> The special case is introduced to allow all {{DecimalType}}s (including both 
> fixed-length and variable-length ones) to be mutable – thus need to leave 
> space for the variable-length field even if it’s currently null.
> Note that this special case in {{UnsafeRowWriter}} has been there since Spark 
> 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was 
> originally added for Structured Streaming’s checkpoint evolution validation, 
> so that a newer version of Spark can check whether or not an older checkpoint 
> file for Structured Streaming queries can be supported, and/or if the 
> contents of the checkpoint file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-27 Thread zhiming she (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572169#comment-17572169
 ] 

zhiming she commented on SPARK-39743:
-

[~hyukjin.kwon]

 

can you mark this issue as `Resolved`.

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread shezm (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572166#comment-17572166
 ] 

shezm commented on SPARK-39900:
---

I can try to fix this issue.

> Issue with querying dataframe produced by 'binaryFile' format using 'not' 
> operator
> --
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> TLDR;
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
>     //     
> // - src/test/resources/files/
>     //  - test1.csv
>     //  - test2.json
>     //  - test3.txt
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39722) Make Dataset.showString() public

2022-07-27 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572165#comment-17572165
 ] 

Erik Krogen commented on SPARK-39722:
-

General +1 from me. We have some internal code that does exactly the 
{{Console.out}} redirection hack you described.

> Make Dataset.showString() public
> 
>
> Key: SPARK-39722
> URL: https://issues.apache.org/jira/browse/SPARK-39722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.0
>Reporter: Jatin Sharma
>Priority: Trivial
>
> Currently, we have {{.show}} APIs on a Dataset, but they print directly to 
> stdout.
> But there are a lot of cases where we might need to get a String 
> representation of the show output. For example
>  * We have a logging framework to which we need to push the representation of 
> a df
>  * We have to send the string over a REST call from the driver
>  * We want to send the string to stderr instead of stdout
> For such cases, currently one needs to do a hack by changing the Console.out 
> temporarily and catching the representation in a ByteArrayOutputStream or 
> similar, then extracting the string from it.
> Strictly only printing to stdout seems like a limiting choice. 
>  
> Solution:
> We expose APIs to return the String representation back. We already have the 
> .{{{}showString{}}} method internally.
>  
> We could mirror the current {{.show}} APIS with a corresponding 
> {{.showString}} (and rename the internal private function to something else 
> if required)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2022-07-27 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-39901:
--

 Summary: Reconsider design of ignoreCorruptFiles feature
 Key: SPARK-39901
 URL: https://issues.apache.org/jira/browse/SPARK-39901
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


I'm filing this ticket as a followup to the discussion at 
[https://github.com/apache/spark/pull/36775#issuecomment-1148136217] regarding 
the `ignoreCorruptFiles` feature: the current implementation is based towards 
considering a broad range of IOExceptions to be corruption, but this is likely 
overly-broad and might mis-identify transient errors as corruption (causing 
non-corrupt data to be erroneously discarded).

SPARK-39389 fixes one instance of that problem, but we are still vulnerable to 
similar issues because of the overall design of this feature.

I think we should reconsider the design of this feature: maybe we should switch 
the default behavior so that only an explicit allowlist of known corruption 
exceptions can cause files to be skipped. This could be done through 
involvement of other parts of the code, e.g. rewrapping exceptions into a 
`CorruptFileException` so higher layers can positively identify corruption.

Any changes to behavior here could potentially impact users jobs, so we'd need 
to think carefully about when we want to change (in a 3.x release? 4.x?) and 
how we want to provide escape hatches (e.g. configs to revert back to old 
behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39898.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37321
[https://github.com/apache/spark/pull/37321]

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39898:
-

Assignee: Dongjoon Hyun

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572125#comment-17572125
 ] 

Apache Spark commented on SPARK-39857:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37324

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.4.0
>
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572124#comment-17572124
 ] 

Apache Spark commented on SPARK-39857:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37324

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.4.0
>
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Summary: Issue with querying dataframe produced by 'binaryFile' format 
using 'not' operator  (was: Querying dataframe produced by 'binaryFile' format 
using 'not' operator)

> Issue with querying dataframe produced by 'binaryFile' format using 'not' 
> operator
> --
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> TLDR;
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
>     //     
> // - src/test/resources/files/
>     //  - test1.csv
>     //  - test2.json
>     //  - test3.txt
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Description: 
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
    //     
// - src/test/resources/files/
    //  - test1.csv
    //  - test2.json
    //  - test3.txt
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")

// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}

  was:
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")



// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}


> Querying dataframe produced by 'binaryFile' format using 'not' operator
> ---
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
>     //     
> // - src/test/resources/files/
>     //  - test1.csv
>     //  - test2.json
>     //  - test3.txt
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Description: 
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]

TLDR;
{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
    //     
// - src/test/resources/files/
    //  - test1.csv
    //  - test2.json
    //  - test3.txt
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")

// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}

  was:
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
    //     
// - src/test/resources/files/
    //  - test1.csv
    //  - test2.json
    //  - test3.txt
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")

// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}


> Querying dataframe produced by 'binaryFile' format using 'not' operator
> ---
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> TLDR;
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
>     //     
> // - src/test/resources/files/
>     //  - test1.csv
>     //  - test2.json
>     //  - test3.txt
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Summary: Querying dataframe produced by 'binaryFile' format using 'not' 
operator  (was: Incorrect result when query dataframe produced by 'binaryFile' 
format)

> Querying dataframe produced by 'binaryFile' format using 'not' operator
> ---
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Description: 
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.

 

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
 

Here's a very simple test case that illustrate what's going on:

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")



// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}

  was:
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.



Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]

 
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}

Here's a very simple test case that illustrate what's going on:

 

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]




{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")



// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}


> Incorrect result when query dataframe produced by 'binaryFile' format
> -
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
>  
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
>  
> Here's a very simple test case that illustrate what's going on:
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format

2022-07-27 Thread Benoit Roy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Roy updated SPARK-39900:
---
Description: 
When creating a dataframe using the binaryFile format I am encountering weird 
result when filtering/query with the 'not' operator.



Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]

 
{code:java}
g...@github.com:cccs-br/spark-binaryfile-issue.git {code}

Here's a very simple test case that illustrate what's going on:

 

[https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]




{code:java}
   test("binary file dataframe") {
// load files in directly into df using 'binaryFile' format.
val df = spark
  .read
  .format("binaryFile")
  .load("src/test/resources/files")

df.createOrReplaceTempView("files")



// This works as expected.
val like_count = spark.sql("select * from files where path like 
'%.csv'").count()
assert(like_count === 1)

// This does not work as expected.
val not_like_count = spark.sql("select * from files where path not like 
'%.csv'").count()
assert(not_like_count === 2)

// This used to work in 3.2.1
// df.filter(col("path").endsWith(".csv") === false).show()
  }{code}

  was:
When creating a dataframe using the binaryFile format.  I am encountering weird 
result when filtering/query with the 'not' operator.

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]

```
g...@github.com:cccs-br/spark-binaryfile-issue.git
```

Here's a very simple test case:

https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala


> Incorrect result when query dataframe produced by 'binaryFile' format
> -
>
> Key: SPARK-39900
> URL: https://issues.apache.org/jira/browse/SPARK-39900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Benoit Roy
>Priority: Minor
>
> When creating a dataframe using the binaryFile format I am encountering weird 
> result when filtering/query with the 'not' operator.
> Here's a repo that will help describe and reproduce the issue.
> [https://github.com/cccs-br/spark-binaryfile-issue]
>  
> {code:java}
> g...@github.com:cccs-br/spark-binaryfile-issue.git {code}
> Here's a very simple test case that illustrate what's going on:
>  
> [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala]
> {code:java}
>test("binary file dataframe") {
> // load files in directly into df using 'binaryFile' format.
> val df = spark
>   .read
>   .format("binaryFile")
>   .load("src/test/resources/files")
> df.createOrReplaceTempView("files")
> // This works as expected.
> val like_count = spark.sql("select * from files where path like 
> '%.csv'").count()
> assert(like_count === 1)
> // This does not work as expected.
> val not_like_count = spark.sql("select * from files where path not like 
> '%.csv'").count()
> assert(not_like_count === 2)
> // This used to work in 3.2.1
> // df.filter(col("path").endsWith(".csv") === false).show()
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format

2022-07-27 Thread Benoit Roy (Jira)

Benoit Roy created SPARK-39900:
--

 Summary: Incorrect result when query dataframe produced by 
'binaryFile' format
 Key: SPARK-39900
 URL: https://issues.apache.org/jira/browse/SPARK-39900
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.1
Reporter: Benoit Roy


When creating a dataframe using the binaryFile format.  I am encountering weird 
result when filtering/query with the 'not' operator.

Here's a repo that will help describe and reproduce the issue.

[https://github.com/cccs-br/spark-binaryfile-issue]

```
g...@github.com:cccs-br/spark-binaryfile-issue.git
```

Here's a very simple test case:

https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572096#comment-17572096
 ] 

Apache Spark commented on SPARK-39899:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37323

> Incorrect passing of message parameters in InvalidUDFClassException
> ---
>
> Key: SPARK-39899
> URL: https://issues.apache.org/jira/browse/SPARK-39899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> In fact, messageParameters is not passed AnalysisException. It is used only 
> to form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39899:


Assignee: Apache Spark  (was: Max Gekk)

> Incorrect passing of message parameters in InvalidUDFClassException
> ---
>
> Key: SPARK-39899
> URL: https://issues.apache.org/jira/browse/SPARK-39899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> In fact, messageParameters is not passed AnalysisException. It is used only 
> to form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39899:


Assignee: Max Gekk  (was: Apache Spark)

> Incorrect passing of message parameters in InvalidUDFClassException
> ---
>
> Key: SPARK-39899
> URL: https://issues.apache.org/jira/browse/SPARK-39899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> In fact, messageParameters is not passed AnalysisException. It is used only 
> to form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572095#comment-17572095
 ] 

Apache Spark commented on SPARK-39899:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37323

> Incorrect passing of message parameters in InvalidUDFClassException
> ---
>
> Key: SPARK-39899
> URL: https://issues.apache.org/jira/browse/SPARK-39899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> In fact, messageParameters is not passed AnalysisException. It is used only 
> to form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException

2022-07-27 Thread Max Gekk (Jira)

Max Gekk created SPARK-39899:


 Summary: Incorrect passing of message parameters in 
InvalidUDFClassException
 Key: SPARK-39899
 URL: https://issues.apache.org/jira/browse/SPARK-39899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


In fact, messageParameters is not passed AnalysisException. It is used only to 
form the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39864) ExecutionListenerManager's registration of the ExecutionListenerBus should be lazy

2022-07-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39864:
--
Affects Version/s: 3.4.0
   (was: 2.0.0)

> ExecutionListenerManager's registration of the ExecutionListenerBus should be 
> lazy
> --
>
> Key: SPARK-39864
> URL: https://issues.apache.org/jira/browse/SPARK-39864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.4.0
>
>
> Today, ExecutionListenerManager eagerly registers an ExecutionListenerBus 
> SparkListener when it is created, even if the SparkSession has no query 
> execution listeners registered. In applications with many short-lived 
> SparkSessions, this can cause a buildup of empty listeners on the shared 
> listener bus, increasing Spark listener processing times on the driver.
> If we make the registration lazy then we avoid this driver-side listener 
> performance overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572016#comment-17572016
 ] 

Apache Spark commented on SPARK-39898:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37321

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572015#comment-17572015
 ] 

Apache Spark commented on SPARK-39898:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37321

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations

2022-07-27 Thread Navin Kumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated SPARK-39845:

Description: 
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321

I can also confirm that the same behavior occurs with FloatType and the use of 
{{java.lang.Float.floatToIntBits}}

  was:
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information:

[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39898:


Assignee: Apache Spark

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39898:


Assignee: (was: Apache Spark)

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations

2022-07-27 Thread Navin Kumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated SPARK-39845:

Description: 
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321
 for 

I can also confirm that the same behavior occurs with FloatType and the use of 
{{java.lang.Float.floatToIntBits}}

  was:
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0} as distinct because of the sign bit.

See here for more information:

[jira] [Updated] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39898:
--
Component/s: Build

> Upgrade kubernetes-client to 5.12.3
> ---
>
> Key: SPARK-39898
> URL: https://issues.apache.org/jira/browse/SPARK-39898
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39898) Upgrade kubernetes-client to 5.12.3

2022-07-27 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-39898:
-

 Summary: Upgrade kubernetes-client to 5.12.3
 Key: SPARK-39898
 URL: https://issues.apache.org/jira/browse/SPARK-39898
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39885) Behavior differs between arrays_overlap and array_contains for negative 0.0

2022-07-27 Thread David Vogelbacher (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-39885:
--
Summary: Behavior differs between arrays_overlap and array_contains for 
negative 0.0  (was: Behavior differs between array_overlap and array_contains 
for negative 0.0)

> Behavior differs between arrays_overlap and array_contains for negative 0.0
> ---
>
> Key: SPARK-39885
> URL: https://issues.apache.org/jira/browse/SPARK-39885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: David Vogelbacher
>Priority: Major
>
> {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], 
> [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 
> as the same (see 
> https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
> However, the {{Double::equals}} method doesn't. Therefore, we should either 
> mark double as false in 
> [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96],
>  or we should wrap it with our own equals method that handles this case.
> Java code snippets showing the issue:
> {code:java}
> dataset = sparkSession.createDataFrame(
> List.of(RowFactory.create(List.of(-0.0))),
> 
> DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
> "doubleCol", 
> DataTypes.createArrayType(DataTypes.DoubleType), false;
> Dataset df = dataset.withColumn(
> "overlaps", 
> functions.arrays_overlap(functions.array(functions.lit(+0.0)), 
> dataset.col("doubleCol")));
> List result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
> {code}
> {code:java}
> dataset = sparkSession.createDataFrame(
> List.of(RowFactory.create(-0.0)),
> DataTypes.createStructType(
> 
> ImmutableList.of(DataTypes.createStructField("doubleCol", 
> DataTypes.DoubleType, false;
> Dataset df = dataset.withColumn(
> "contains", 
> functions.array_contains(functions.array(functions.lit(+0.0)), 
> dataset.col("doubleCol")));
> List result = df.collectAsList(); // [[-0.0,true]]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39885) Behavior differs between array_overlap and array_contains for negative 0.0

2022-07-27 Thread David Vogelbacher (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-39885:
--
Description: 
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], 
[-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as 
the same (see 
https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either 
mark double as false in 
[TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96],
 or we should wrap it with our own equals method that handles this case.

Java code snippets showing the issue:

{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(List.of(-0.0))),

DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
"doubleCol", 
DataTypes.createArrayType(DataTypes.DoubleType), false;
Dataset df = dataset.withColumn(
"overlaps", 
functions.arrays_overlap(functions.array(functions.lit(+0.0)), 
dataset.col("doubleCol")));
List result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}

{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(-0.0)),
DataTypes.createStructType(

ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, 
false;
Dataset df = dataset.withColumn(
"contains", 
functions.array_contains(functions.array(functions.lit(+0.0)), 
dataset.col("doubleCol")));
List result = df.collectAsList(); // [[-0.0,true]]
{code}


  was:
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], 
[-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as 
the same (see 
https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either 
mark double as false in 
[TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96],
 or we should wrap it with our own equals method that handles this case.

Java code snippets showing the issue:

{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(List.of(-0.0))),

DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
"doubleCol", 
DataTypes.createArrayType(DataTypes.DoubleType), false;
Dataset df = dataset.withColumn(
"overlaps", 
functions.arrays_overlap(functions.array(functions.lit(+0.0)), 
dataset.col("doubleCol")));
List result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}

{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(-0.0)),
DataTypes.createStructType(

ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, 
false;
Dataset df = dataset.withColumn(
"overlaps", 
functions.array_contains(functions.array(functions.lit(+0.0)), 
dataset.col("doubleCol")));
List result = df.collectAsList(); // [[-0.0,true]]
{code}



> Behavior differs between array_overlap and array_contains for negative 0.0
> --
>
> Key: SPARK-39885
> URL: https://issues.apache.org/jira/browse/SPARK-39885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: David Vogelbacher
>Priority: Major
>
> {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], 
> [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 
> as the same (see 
> https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
> However, the {{Double::equals}} method doesn't. Therefore, we should either 
> mark double as false in 
> [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96],
>  or we should wrap it with our own equals method that handles this case.
> Java code snippets showing the issue:
>

[jira] [Created] (SPARK-39897) StackOverflowError in TaskMemoryManager

2022-07-27 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39897:
--

 Summary: StackOverflowError in TaskMemoryManager
 Key: SPARK-39897
 URL: https://issues.apache.org/jira/browse/SPARK-39897
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.7
Reporter: Andrew Ray


I have observed the following error that looks to stem from 
TaskMemoryManager.allocatePage making a recursive call to itself when a page 
can not be allocated. I'm observing this in Spark 2.4 but since the relevant 
code is still the same in master this is likely still a potential point of 
failure in current versions. Prioritizing this as minor as this looks to be a 
very uncommon outcome as I can not find any other reports of a similar nature.
{code:java}
Py4JJavaError: An error occurred while calling o625.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.StackOverflowError
at 
java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
at 
java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535)
at java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:457)
at java.lang.ClassLoader.loadClass(ClassLoader.java:398)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:512)
at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2657)
at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:899)
at sun.util.resources.LocaleData$1.run(LocaleData.java:167)

[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-27 Thread shezm (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571995#comment-17571995
 ] 

shezm commented on SPARK-39743:
---

[~yeachan153] 

spark.io.compression.zstd.level is adapted to 
{{{}spark.io.compression.codec{}}}. It only works on internal data.

 

If you want to set a different zstd level to write parquet files , you can set 
`parquet.compression.codec.zstd.level` in sparkConf, like :

 
{code:java}
val spark = SparkSession
      .builder()
      .master("local")
      .appName("spark example")
      .config("spark.sql.parquet.compression.codec", "zstd")
      .config("parquet.compression.codec.zstd.level", 10)  // here 
      .getOrCreate(); 

val csvfile = spark.read.csv("file:///home/test_data/Reviews.csv")
csvfile.coalesce(1).write.parquet("file:///home/test_data/nn_parq_10"){code}
 

 

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-07-27 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978
 ] 

pralabhkumar edited comment on SPARK-39375 at 7/27/22 2:51 PM:
---

This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further sub tasks , so that other people can contribute to it .  


was (Author: pralabhkumar):
This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further tasks , so that other people can contribute to it .  

> SPIP: Spark Connect - A client and server interface for Apache Spark.
> -
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-07-27 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978
 ] 

pralabhkumar commented on SPARK-39375:
--

This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further tasks , so that other people can contribute to it .  

> SPIP: Spark Connect - A client and server interface for Apache Spark.
> -
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-07-27 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571950#comment-17571950
 ] 

Yuming Wang commented on SPARK-39896:
-

cc [~fchen]

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-07-27 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-39896:
---

 Summary: The structural integrity of the plan is broken after 
UnwrapCastInBinaryComparison
 Key: SPARK-39896
 URL: https://issues.apache.org/jira/browse/SPARK-39896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


{code:scala}
sql("create table t1(a decimal(3, 0)) using parquet")
sql("insert into t1 values(100), (10), (1)")
sql("select * from t1 where a in(10, 10, 0, 1.00)").show
{code}

{noformat}
After applying rule 
org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
Operator Optimization before Inferring Filters, the structural integrity of the 
plan is broken.
java.lang.RuntimeException: After applying rule 
org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
Operator Optimization before Inferring Filters, the structural integrity of the 
plan is broken.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39880) V2 SHOW FUNCTIONS command should print qualified function name like v1

2022-07-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39880:


Assignee: Wenchen Fan

> V2 SHOW FUNCTIONS command should print qualified function name like v1
> --
>
> Key: SPARK-39880
> URL: https://issues.apache.org/jira/browse/SPARK-39880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39880) V2 SHOW FUNCTIONS command should print qualified function name like v1

2022-07-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39880.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37301
[https://github.com/apache/spark/pull/37301]

> V2 SHOW FUNCTIONS command should print qualified function name like v1
> --
>
> Key: SPARK-39880
> URL: https://issues.apache.org/jira/browse/SPARK-39880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571871#comment-17571871
 ] 

Apache Spark commented on SPARK-39819:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37320

> DS V2 aggregate push down can work with Top N or Paging (Sort with group 
> expressions)
> -
>
> Key: SPARK-39819
> URL: https://issues.apache.org/jira/browse/SPARK-39819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with Top N (order by ... 
> limit ...) or Paging (order by ... limit ... offset ...).
> If it can work with Top N or Paging, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39887) Expression transform error

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571865#comment-17571865
 ] 

Apache Spark commented on SPARK-39887:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/37319

> Expression transform error
> --
>
> Key: SPARK-39887
> URL: https://issues.apache.org/jira/browse/SPARK-39887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.2.2
>Reporter: zhuml
>Priority: Major
>
> {code:java}
> spark.sql(
>   """
> |select to_date(a) a, to_date(b) b from
> |(select  a, a as b from
> |(select to_date(a) a from
> | values ('2020-02-01') as t1(a)
> | group by to_date(a)) t3
> |union all
> |select a, b from
> |(select to_date(a) a, to_date(b) b from
> |values ('2020-01-01','2020-01-02') as t1(a, b)
> | group by to_date(a), to_date(b)) t4) t5
> |group by to_date(a), to_date(b)
> |""".stripMargin).show(){code}
> result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01)
> expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39887) Expression transform error

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39887:


Assignee: Apache Spark

> Expression transform error
> --
>
> Key: SPARK-39887
> URL: https://issues.apache.org/jira/browse/SPARK-39887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.2.2
>Reporter: zhuml
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> spark.sql(
>   """
> |select to_date(a) a, to_date(b) b from
> |(select  a, a as b from
> |(select to_date(a) a from
> | values ('2020-02-01') as t1(a)
> | group by to_date(a)) t3
> |union all
> |select a, b from
> |(select to_date(a) a, to_date(b) b from
> |values ('2020-01-01','2020-01-02') as t1(a, b)
> | group by to_date(a), to_date(b)) t4) t5
> |group by to_date(a), to_date(b)
> |""".stripMargin).show(){code}
> result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01)
> expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39887) Expression transform error

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39887:


Assignee: (was: Apache Spark)

> Expression transform error
> --
>
> Key: SPARK-39887
> URL: https://issues.apache.org/jira/browse/SPARK-39887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.2.2
>Reporter: zhuml
>Priority: Major
>
> {code:java}
> spark.sql(
>   """
> |select to_date(a) a, to_date(b) b from
> |(select  a, a as b from
> |(select to_date(a) a from
> | values ('2020-02-01') as t1(a)
> | group by to_date(a)) t3
> |union all
> |select a, b from
> |(select to_date(a) a, to_date(b) b from
> |values ('2020-01-01','2020-01-02') as t1(a, b)
> | group by to_date(a), to_date(b)) t4) t5
> |group by to_date(a), to_date(b)
> |""".stripMargin).show(){code}
> result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01)
> expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39887) Expression transform error

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571864#comment-17571864
 ] 

Apache Spark commented on SPARK-39887:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/37319

> Expression transform error
> --
>
> Key: SPARK-39887
> URL: https://issues.apache.org/jira/browse/SPARK-39887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.2.2
>Reporter: zhuml
>Priority: Major
>
> {code:java}
> spark.sql(
>   """
> |select to_date(a) a, to_date(b) b from
> |(select  a, a as b from
> |(select to_date(a) a from
> | values ('2020-02-01') as t1(a)
> | group by to_date(a)) t3
> |union all
> |select a, b from
> |(select to_date(a) a, to_date(b) b from
> |values ('2020-01-01','2020-01-02') as t1(a, b)
> | group by to_date(a), to_date(b)) t4) t5
> |group by to_date(a), to_date(b)
> |""".stripMargin).show(){code}
> result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01)
> expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-27 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39819:
---
Summary: DS V2 aggregate push down can work with Top N or Paging (Sort with 
group expressions)  (was: DS V2 aggregate push down can work with Top N or 
Paging (Sort with group column))

> DS V2 aggregate push down can work with Top N or Paging (Sort with group 
> expressions)
> -
>
> Key: SPARK-39819
> URL: https://issues.apache.org/jira/browse/SPARK-39819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with Top N (order by ... 
> limit ...) or Paging (order by ... limit ... offset ...).
> If it can work with Top N or Paging, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39890:


Assignee: (was: Apache Spark)

> Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
> ---
>
> Key: SPARK-39890
> URL: https://issues.apache.org/jira/browse/SPARK-39890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> AliasAwareOutputOrdering can save a sort if the project inside 
> TakeOrderedAndProjectExec has an alias for the sort order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571831#comment-17571831
 ] 

Apache Spark commented on SPARK-39890:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37318

> Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
> ---
>
> Key: SPARK-39890
> URL: https://issues.apache.org/jira/browse/SPARK-39890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> AliasAwareOutputOrdering can save a sort if the project inside 
> TakeOrderedAndProjectExec has an alias for the sort order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39890:


Assignee: Apache Spark

> Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
> ---
>
> Key: SPARK-39890
> URL: https://issues.apache.org/jira/browse/SPARK-39890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> AliasAwareOutputOrdering can save a sort if the project inside 
> TakeOrderedAndProjectExec has an alias for the sort order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39895) pyspark drop doesn't accept *cols

2022-07-27 Thread Santosh Pingale (Jira)

Santosh Pingale created SPARK-39895:
---

 Summary: pyspark drop doesn't accept *cols 
 Key: SPARK-39895
 URL: https://issues.apache.org/jira/browse/SPARK-39895
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.2, 3.3.0, 3.0.3
Reporter: Santosh Pingale


Pyspark dataframe drop has following signature:

{color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> "DataFrame":}}{color}

However when we try to pass multiple Column types to drop function it raises 
TypeError

{{each col in the param list should be a string}}

*Minimal reproducible example:*
{color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), 
("id_1", 3, 3), ("id_2", 4, 3)]{color}
{color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, count 
int"){color}
|– id: string (nullable = true)|
|– point: integer (nullable = true)|
|– count: integer (nullable = true)|

{color:#4c9aff}{{df.drop(df.point, df.count)}}{color}
{quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in 
drop(self, *cols){color}
{color:#505f79}2537 for col in cols:{color}
{color:#505f79}2538 if not isinstance(col, str):{color}
{color:#505f79}-> 2539 raise TypeError("each col in the param list should be a 
string"){color}
{color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color}
{color:#505f79}2541{color}

{color:#505f79}TypeError: each col in the param list should be a string{color}
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39894) Combine the similar binary comparison in boolean expression.

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571817#comment-17571817
 ] 

Apache Spark commented on SPARK-39894:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37317

> Combine the similar binary comparison in boolean expression.
> 
>
> Key: SPARK-39894
> URL: https://issues.apache.org/jira/browse/SPARK-39894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> If boolean expression have two similar binary comparisons onnect with And. 
> i.e. 'a > 1 and 'a > 2.
> We should simplify them to 'a > 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39894) Combine the similar binary comparison in boolean expression.

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39894:


Assignee: Apache Spark

> Combine the similar binary comparison in boolean expression.
> 
>
> Key: SPARK-39894
> URL: https://issues.apache.org/jira/browse/SPARK-39894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> If boolean expression have two similar binary comparisons onnect with And. 
> i.e. 'a > 1 and 'a > 2.
> We should simplify them to 'a > 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39894) Combine the similar binary comparison in boolean expression.

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571816#comment-17571816
 ] 

Apache Spark commented on SPARK-39894:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37317

> Combine the similar binary comparison in boolean expression.
> 
>
> Key: SPARK-39894
> URL: https://issues.apache.org/jira/browse/SPARK-39894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> If boolean expression have two similar binary comparisons onnect with And. 
> i.e. 'a > 1 and 'a > 2.
> We should simplify them to 'a > 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39894) Combine the similar binary comparison in boolean expression.

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39894:


Assignee: (was: Apache Spark)

> Combine the similar binary comparison in boolean expression.
> 
>
> Key: SPARK-39894
> URL: https://issues.apache.org/jira/browse/SPARK-39894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> If boolean expression have two similar binary comparisons onnect with And. 
> i.e. 'a > 1 and 'a > 2.
> We should simplify them to 'a > 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39731) Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON

2022-07-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39731:
---

Assignee: Ivan Sadikov

> Correctness issue when parsing dates with MMdd format in CSV and JSON
> -
>
> Key: SPARK-39731
> URL: https://issues.apache.org/jira/browse/SPARK-39731
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>
> In Spark 3.x, when reading CSV data like this:
> {code:java}
> name,mydate
> 1,2020011
> 2,20201203{code}
> and specifying date pattern as "MMdd", dates are not parsed correctly 
> with CORRECTED time parser policy.
> For example,
> {code:java}
> val df = spark.read.schema("name string, mydate date").option("dateFormat", 
> "MMdd").option("header", "true").csv("file:/tmp/test.csv")
> df.show(false){code}
> Returns:
> {code:java}
> ++--+
> |name|mydate|
> ++--+
> |1   |+2020011-01-01|
> |2   |2020-12-03|
> ++--+ {code}
> and it used to return null instead of the invalid date in Spark 3.2 or below.
>  
> The issue appears to be caused by this PR: 
> [https://github.com/apache/spark/pull/32959].
>  
> A similar issue can observed in JSON data source.
> test.json
> {code:java}
> {"date": "2020011"}
> {"date": "20201203"} {code}
>  
> Running commands
> {code:java}
> val df = spark.read.schema("date date").option("dateFormat", 
> "MMdd").json("file:/tmp/test.json")
> df.show(false) {code}
> returns
> {code:java}
> +--+
> |date          |
> +--+
> |+2020011-01-01|
> |2020-12-03    |
> +--+{code}
> but before the patch linked in the description it used to show:
> {code:java}
> +--+
> |date      |
> +--+
> |7500-08-09|
> |2020-12-03|
> +--+{code}
> which is strange either way. I will try to address it in the PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39731) Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON

2022-07-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39731.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37147
[https://github.com/apache/spark/pull/37147]

> Correctness issue when parsing dates with MMdd format in CSV and JSON
> -
>
> Key: SPARK-39731
> URL: https://issues.apache.org/jira/browse/SPARK-39731
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark 3.x, when reading CSV data like this:
> {code:java}
> name,mydate
> 1,2020011
> 2,20201203{code}
> and specifying date pattern as "MMdd", dates are not parsed correctly 
> with CORRECTED time parser policy.
> For example,
> {code:java}
> val df = spark.read.schema("name string, mydate date").option("dateFormat", 
> "MMdd").option("header", "true").csv("file:/tmp/test.csv")
> df.show(false){code}
> Returns:
> {code:java}
> ++--+
> |name|mydate|
> ++--+
> |1   |+2020011-01-01|
> |2   |2020-12-03|
> ++--+ {code}
> and it used to return null instead of the invalid date in Spark 3.2 or below.
>  
> The issue appears to be caused by this PR: 
> [https://github.com/apache/spark/pull/32959].
>  
> A similar issue can observed in JSON data source.
> test.json
> {code:java}
> {"date": "2020011"}
> {"date": "20201203"} {code}
>  
> Running commands
> {code:java}
> val df = spark.read.schema("date date").option("dateFormat", 
> "MMdd").json("file:/tmp/test.json")
> df.show(false) {code}
> returns
> {code:java}
> +--+
> |date          |
> +--+
> |+2020011-01-01|
> |2020-12-03    |
> +--+{code}
> but before the patch linked in the description it used to show:
> {code:java}
> +--+
> |date      |
> +--+
> |7500-08-09|
> |2020-12-03|
> +--+{code}
> which is strange either way. I will try to address it in the PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39893:


Assignee: Apache Spark

> Remove redundant aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> --
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571804#comment-17571804
 ] 

Apache Spark commented on SPARK-39893:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/37316

> Remove redundant aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> --
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39893:


Assignee: (was: Apache Spark)

> Remove redundant aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> --
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571803#comment-17571803
 ] 

Apache Spark commented on SPARK-39893:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/37316

> Remove redundant aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> --
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39873) Remove OptimizeLimitZero and merge it into EliminateLimits

2022-07-27 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39873:
---
Affects Version/s: 3.4.0
   (was: 2.4.0)

> Remove OptimizeLimitZero and merge it into EliminateLimits
> --
>
> Key: SPARK-39873
> URL: https://issues.apache.org/jira/browse/SPARK-39873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark Optimizer have two rule: OptimizeLimitZero and 
> EliminateLimits.
> In fact, OptimizeLimitZero is one case of eliminate limits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39894) Combine the similar binary comparison in boolean expression.

2022-07-27 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39894:
--

 Summary: Combine the similar binary comparison in boolean 
expression.
 Key: SPARK-39894
 URL: https://issues.apache.org/jira/browse/SPARK-39894
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


If boolean expression have two similar binary comparisons onnect with And. i.e. 
'a > 1 and 'a > 2.
We should simplify them to 'a > 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-39893:

Summary: Remove redundant aggregate if it is group only and all grouping 
and aggregate expressions are foldable  (was: Remove Aggregate if it is group 
only and all grouping and aggregate expressions are foldable)

> Remove redundant aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> --
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39893) Remove Aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-39893:

Summary: Remove Aggregate if it is group only and all grouping and 
aggregate expressions are foldable  (was: Remote Aggregate if it is group only 
and all grouping and aggregate expressions are foldable)

> Remove Aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> 
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 122 matches

Mail list logo