[jira] [Commented] (SPARK-46097) Push down limit 1 though Union and Aggregate

2023-11-24 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789619#comment-17789619
 ] 

Yuming Wang commented on SPARK-46097:
-

https://github.com/apache/spark/pull/44009

> Push down limit 1 though Union and Aggregate
> 
>
> Key: SPARK-46097
> URL: https://issues.apache.org/jira/browse/SPARK-46097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46097) Push down limit 1 though Union and Aggregate

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46097:
---
Labels: pull-request-available  (was: )

> Push down limit 1 though Union and Aggregate
> 
>
> Key: SPARK-46097
> URL: https://issues.apache.org/jira/browse/SPARK-46097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46097) Push down limit 1 though Union and Aggregate

2023-11-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-46097:
---

 Summary: Push down limit 1 though Union and Aggregate
 Key: SPARK-46097
 URL: https://issues.apache.org/jira/browse/SPARK-46097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46096) Upgrade sbt to 1.9.7

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46096:
---
Labels: pull-request-available  (was: )

> Upgrade sbt to 1.9.7
> 
>
> Key: SPARK-46096
> URL: https://issues.apache.org/jira/browse/SPARK-46096
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46096) Upgrade sbt to 1.9.7

2023-11-24 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-46096:
-

 Summary: Upgrade sbt to 1.9.7
 Key: SPARK-46096
 URL: https://issues.apache.org/jira/browse/SPARK-46096
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46095:
--
Fix Version/s: 3.3.4

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46085:
-

Assignee: Hyukjin Kwon

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46085.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43995
[https://github.com/apache/spark/pull/43995]

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46095:
--
Fix Version/s: 3.4.2
   3.5.1

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46095.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44007
[https://github.com/apache/spark/pull/44007]

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46095:
-

Assignee: Dongjoon Hyun

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46066.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43973
[https://github.com/apache/spark/pull/43973]

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46066:
-

Assignee: Yang Jie

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45922) Multiple policies follow-up (Python)

2023-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45922.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43800
[https://github.com/apache/spark/pull/43800]

> Multiple policies follow-up (Python)
> 
>
> Key: SPARK-45922
> URL: https://issues.apache.org/jira/browse/SPARK-45922
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Minor further improvements for multiple policies work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46095:
---
Labels: pull-request-available  (was: )

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46095:
--
Parent Issue: SPARK-45869  (was: SPARK-44111)

> Document REST API for Spark Standalone Cluster
> --
>
> Key: SPARK-46095
> URL: https://issues.apache.org/jira/browse/SPARK-46095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46095) Document REST API for Spark Standalone Cluster

2023-11-24 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-46095:
-

 Summary: Document REST API for Spark Standalone Cluster
 Key: SPARK-46095
 URL: https://issues.apache.org/jira/browse/SPARK-46095
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-44641:
---
Labels: correctness  (was: )

> SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> --
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Szehon Ho
>Assignee: Chao Sun
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.4.2, 3.5.0
>
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42134) Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-42134:
---
Labels: correctness  (was: )

> Fix getPartitionFiltersAndDataFilters() to handle filters without referenced 
> attributes
> ---
>
> Key: SPARK-42134
> URL: https://issues.apache.org/jira/browse/SPARK-42134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.2, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43760) Incorrect attribute nullability after RewriteCorrelatedScalarSubquery leads to incorrect query results

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-43760:
---
Labels: correctness  (was: )

> Incorrect attribute nullability after RewriteCorrelatedScalarSubquery leads 
> to incorrect query results
> --
>
> Key: SPARK-43760
> URL: https://issues.apache.org/jira/browse/SPARK-43760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.1, 3.5.0
>
>
> The following query:
>  
> {code:java}
> select * from (
>  select t1.id c1, (
>   select t2.id c from range (1, 2) t2
>   where t1.id = t2.id  ) c2
>  from range (1, 3) t1 ) t
> where t.c2 is not null
> -- !query schema
> struct
> -- !query output
> 1 1
> 2 NULL
>  {code}
>  
> should return 1 row, because the second row is supposed to be removed by 
> IsNotNull predicate. However, due to a wrong nullability propagation after 
> subquery decorrelation, the output of the subquery is declared as 
> not-nullable (incorrectly), so the predicate is constant folded into True.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8:
---
Labels: correctness  (was: )

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: correctness
> Fix For: 3.5.0
>
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45920) group by ordinal should be idempotent

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-45920:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> group by ordinal should be idempotent
> -
>
> Key: SPARK-45920
> URL: https://issues.apache.org/jira/browse/SPARK-45920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45507) Correctness bug in correlated scalar subqueries with COUNT aggregates

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-45507:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> Correctness bug in correlated scalar subqueries with COUNT aggregates
> -
>
> Key: SPARK-45507
> URL: https://issues.apache.org/jira/browse/SPARK-45507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
>  
> create view if not exists t1(a1, a2) as values (0, 1), (1, 2);
> create view if not exists t2(b1, b2) as values (0, 2), (0, 3);
> create view if not exists t3(c1, c2) as values (0, 2), (0, 3);
> -- Example 1
> select (
>   select SUM(l.cnt + r.cnt)
>   from (select count(*) cnt from t2 where t1.a1 = t2.b1 having cnt = 0) l
>   join (select count(*) cnt from t3 where t1.a1 = t3.c1 having cnt = 0) r
>   on l.cnt = r.cnt
> ) from t1
> -- Correct answer: (null, 0)
> +--+
> |scalarsubquery(c1, c1)|
> +--+
> |null  |
> |null  |
> +--+
> -- Example 2
> select ( select sum(cnt) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (2, 0)
> +--+
> |scalarsubquery(c1)|
> +--+
> |2 |
> |null  |
> +--+
> -- Example 3
> select ( select count(*) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (1, 1)
> +--+
> |scalarsubquery(c1)|
> +--+
> |1 |
> |0 |
> +--+ {code}
>  
>  
> DB fiddle for correctness 
> check:[https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10403#]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-46092:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> Overflow in Parquet row group filter creation causes incorrect results
> --
>
> Key: SPARK-46092
> URL: https://issues.apache.org/jira/browse/SPARK-46092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Johan Lasperas
>Priority: Major
>  Labels: correctness, pull-request-available
>
> While the parquet readers don't support reading parquet values into larger 
> Spark types, it's possible to trigger an overflow when creating a Parquet row 
> group filter that will then incorrectly skip row groups and bypass the 
> exception in the reader,
> Repro:
> {code:java}
> Seq(0).toDF("a").write.parquet(path)
> spark.read.schema("a LONG").parquet(path).where(s"a < 
> ${Long.MaxValue}").collect(){code}
> This succeeds and returns no results. This should either fail if the Parquet 
> reader doesn't support the upcast from int to long or produce result `[0]` if 
> it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-45386:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> Correctness issue when persisting using StorageLevel.NONE
> -
>
> Key: SPARK-45386
> URL: https://issues.apache.org/jira/browse/SPARK-45386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.5.1
>
>
> When using spark 3.5.0 this code
> {code:java}
> import org.apache.spark.storage.StorageLevel
> spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code}
> incorrectly returns 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-44871:
---
Labels: correctness  (was: )

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
>  Labels: correctness
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43393) Sequence expression can overflow

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-43393:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> Sequence expression can overflow
> 
>
> Key: SPARK-43393
> URL: https://issues.apache.org/jira/browse/SPARK-43393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Deepayan Patra
>Assignee: Deepayan Patra
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>
> Spark has a (long-standing) overflow bug in the {{sequence}} expression.
>  
> Consider the following operations:
> {{spark.sql("CREATE TABLE foo (l LONG);")}}
> {{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}}
> {{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}}
>  
> The result of these operations will be:
> {{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}}
> an unintended consequence of overflow.
>  
> The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step 
> size of {{1}} which uses a length computation defined 
> [here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451].
>  In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, 
> and {{{}step = 1{}}}, the calculated {{len}} overflows to 
> {{{}Long.MinValue{}}}. The computation, in binary looks like:
> 0111 -
> 
> --      
> 0111 /
> 0001
> --        
>         0111 +
> 0001
> --      
> 1000
> The following 
> [check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454]
>  passes as the negative {{Long.MinValue}} is still {{{}<= 
> MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this 
> representation and [truncates the upper 
> bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457]
>  resulting in an empty length of 0.
> Other overflows are similarly problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43240) df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-43240:
---
Labels: correctness  (was: )

> df.describe() method may- return wrong result if the last RDD is 
> RDD[UnsafeRow]
> ---
>
> Key: SPARK-43240
> URL: https://issues.apache.org/jira/browse/SPARK-43240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.3
>
>
> When calling the df.describe() method, the result  maybe wrong when the last 
> RDD is RDD[UnsafeRow]. It is because the UnsafeRow will be released after the 
> row is used. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty

2023-11-24 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-43098:
---
Labels: correctness  (was: )

> Should not handle the COUNT bug when the GROUP BY clause of a correlated 
> scalar subquery is non-empty
> -
>
> Key: SPARK-43098
> URL: https://issues.apache.org/jira/browse/SPARK-43098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.1, 3.5.0
>
>
> From [~allisonwang-db] :
> There is no COUNT bug when the correlated equality predicates are also in the 
> group by clause. However, the current logic to handle the COUNT bug still 
> adds default aggregate function value and returns incorrect results.
>  
> {code:java}
> create view t1(c1, c2) as values (0, 1), (1, 2);
> create view t2(c1, c2) as values (0, 2), (0, 3);
> select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from 
> t1;
> -- Correct answer: [(0, 1, 2), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |2 |
> |1  |2  |0 |
> +---+---+--+
>  {code}
>  
> This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but 
> lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: 
> https://issues.apache.org/jira/browse/SPARK-36113 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46094) Add support for code profiling executors

2023-11-24 Thread Parth Chandra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra updated SPARK-46094:
--
Component/s: Connect Contrib
 (was: Spark Core)

> Add support for code profiling executors
> 
>
> Key: SPARK-46094
> URL: https://issues.apache.org/jira/browse/SPARK-46094
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect Contrib
>Affects Versions: 4.0.0
>Reporter: Parth Chandra
>Priority: Major
>
> To profile a Spark application a user or developer has to run a spark job 
> locally on the development machine and use a tool like Java flight recorder, 
> Yourkit, or async-profiler to record profiling information. Because profiling 
> can be expensive, the profiler is typically attached to the Spark jvm process 
> after the process has started and stopped once sufficient profiling data is 
> collected.
> The developers environment is frequently different from the production 
> environment and may not yield accurate information.
> However, the profiling process is hard when a Spark application runs as a 
> distributed job on a cluster where the developer may have limited access to 
> the actual nodes where the executor processes are running.  Also, in 
> environments like Kubernetes where the executor pods may be removed as soon 
> as the job completes, retrieving the profiling information from each executor 
> pod can become quite tricky.
> This feature is to add a low overhead sampling profiler like async-profiler 
> as a built in capability to the Spark job that can be turned on using only 
> user configurable parameters (async-profiler is a low overhead profiler that 
> can be invoked programmatically and is available as a single multi-platform 
> jar (for linux, and mac).
> In addition, for convenience, the feature would save profiling output files 
> to the distributed file system so that information from all executors can be 
> available in a single place.
> The feature would add an executor plugin that does not add any overhead 
> unless enabled and can be configured to accept profiler arguments as a 
> configuration parameter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46094) Add support for code profiling executors

2023-11-24 Thread Parth Chandra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra updated SPARK-46094:
--
Component/s: Connect
 (was: Connect Contrib)

> Add support for code profiling executors
> 
>
> Key: SPARK-46094
> URL: https://issues.apache.org/jira/browse/SPARK-46094
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Parth Chandra
>Priority: Major
>
> To profile a Spark application a user or developer has to run a spark job 
> locally on the development machine and use a tool like Java flight recorder, 
> Yourkit, or async-profiler to record profiling information. Because profiling 
> can be expensive, the profiler is typically attached to the Spark jvm process 
> after the process has started and stopped once sufficient profiling data is 
> collected.
> The developers environment is frequently different from the production 
> environment and may not yield accurate information.
> However, the profiling process is hard when a Spark application runs as a 
> distributed job on a cluster where the developer may have limited access to 
> the actual nodes where the executor processes are running.  Also, in 
> environments like Kubernetes where the executor pods may be removed as soon 
> as the job completes, retrieving the profiling information from each executor 
> pod can become quite tricky.
> This feature is to add a low overhead sampling profiler like async-profiler 
> as a built in capability to the Spark job that can be turned on using only 
> user configurable parameters (async-profiler is a low overhead profiler that 
> can be invoked programmatically and is available as a single multi-platform 
> jar (for linux, and mac).
> In addition, for convenience, the feature would save profiling output files 
> to the distributed file system so that information from all executors can be 
> available in a single place.
> The feature would add an executor plugin that does not add any overhead 
> unless enabled and can be configured to accept profiler arguments as a 
> configuration parameter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46094) Add support for code profiling executors

2023-11-24 Thread Parth Chandra (Jira)
Parth Chandra created SPARK-46094:
-

 Summary: Add support for code profiling executors
 Key: SPARK-46094
 URL: https://issues.apache.org/jira/browse/SPARK-46094
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Parth Chandra


To profile a Spark application a user or developer has to run a spark job 
locally on the development machine and use a tool like Java flight recorder, 
Yourkit, or async-profiler to record profiling information. Because profiling 
can be expensive, the profiler is typically attached to the Spark jvm process 
after the process has started and stopped once sufficient profiling data is 
collected.

The developers environment is frequently different from the production 
environment and may not yield accurate information.

However, the profiling process is hard when a Spark application runs as a 
distributed job on a cluster where the developer may have limited access to the 
actual nodes where the executor processes are running.  Also, in environments 
like Kubernetes where the executor pods may be removed as soon as the job 
completes, retrieving the profiling information from each executor pod can 
become quite tricky.

This feature is to add a low overhead sampling profiler like async-profiler as 
a built in capability to the Spark job that can be turned on using only user 
configurable parameters (async-profiler is a low overhead profiler that can be 
invoked programmatically and is available as a single multi-platform jar (for 
linux, and mac).

In addition, for convenience, the feature would save profiling output files to 
the distributed file system so that information from all executors can be 
available in a single place.

The feature would add an executor plugin that does not add any overhead unless 
enabled and can be configured to accept profiler arguments as a configuration 
parameter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46092:
---
Labels: pull-request-available  (was: )

> Overflow in Parquet row group filter creation causes incorrect results
> --
>
> Key: SPARK-46092
> URL: https://issues.apache.org/jira/browse/SPARK-46092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Johan Lasperas
>Priority: Major
>  Labels: pull-request-available
>
> While the parquet readers don't support reading parquet values into larger 
> Spark types, it's possible to trigger an overflow when creating a Parquet row 
> group filter that will then incorrectly skip row groups and bypass the 
> exception in the reader,
> Repro:
> {code:java}
> Seq(0).toDF("a").write.parquet(path)
> spark.read.schema("a LONG").parquet(path).where(s"a < 
> ${Long.MaxValue}").collect(){code}
> This succeeds and returns no results. This should either fail if the Parquet 
> reader doesn't support the upcast from int to long or produce result `[0]` if 
> it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46093) append to parquet file with column type changed corrupts fie

2023-11-24 Thread richard gooding (Jira)
richard gooding created SPARK-46093:
---

 Summary: append to parquet file with column type changed corrupts 
fie
 Key: SPARK-46093
 URL: https://issues.apache.org/jira/browse/SPARK-46093
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.3.0
Reporter: richard gooding


from pyspark.sql.functions import *
from pyspark.sql.types import *

fnBad = "dbfs:/tmp/richard.good...@os.uk/test_bad_parquet/f1"
df = spark.createDataFrame( [ ["" ] ] ).select( col("_1").alias("aa") )
df.printSchema()

fmt = "parquet"
# fmt = "delta"
df.write.mode("overwrite").format( fmt ) .save( fnBad )
df.show()

df = df.withColumn( "aa", struct( col("aa")) ) # change type of column - error 
on load
df.printSchema()
df.show()
df.write.mode("append").format( fmt).save( fnBad ) # format = delta :   
"AnalysisException: Failed to merge fields 'aa' and 'aa'. Failed to merge 
incompatible data types StringType and 
StructType(StructField(aa,StringType,true))"
# df.write.mode("append").option("mergeSchema", "true").format(fmt).save( fnBad 
) # gives a different error, but only when dataframe read

print(" --- at df 2 --- ")
df2 = spark.read.format(fmt).load( fnBad )
# df2 = spark.read.option("mergeSchema", "true").format(fmt).load( fnBad )
df2.show()  # this will error - 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-46092:
---
Description: 
While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:
{code:java}
Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect(){code}
This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.

  was:
While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:

```

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect()

```

This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.


> Overflow in Parquet row group filter creation causes incorrect results
> --
>
> Key: SPARK-46092
> URL: https://issues.apache.org/jira/browse/SPARK-46092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Johan Lasperas
>Priority: Major
>
> While the parquet readers don't support reading parquet values into larger 
> Spark types, it's possible to trigger an overflow when creating a Parquet row 
> group filter that will then incorrectly skip row groups and bypass the 
> exception in the reader,
> Repro:
> {code:java}
> Seq(0).toDF("a").write.parquet(path)
> spark.read.schema("a LONG").parquet(path).where(s"a < 
> ${Long.MaxValue}").collect(){code}
> This succeeds and returns no results. This should either fail if the Parquet 
> reader doesn't support the upcast from int to long or produce result `[0]` if 
> it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-46092:
--

 Summary: Overflow in Parquet row group filter creation causes 
incorrect results
 Key: SPARK-46092
 URL: https://issues.apache.org/jira/browse/SPARK-46092
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Johan Lasperas


While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:

```

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect()

```

This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46091) [KUBERNETES] Respect the existing kubernetes container SPARK_LOCAL_DIRS env

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46091:
---
Labels: pull-request-available  (was: )

> [KUBERNETES] Respect the existing kubernetes container SPARK_LOCAL_DIRS env
> ---
>
> Key: SPARK-46091
> URL: https://issues.apache.org/jira/browse/SPARK-46091
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Fei Wang
>Priority: Major
>  Labels: pull-request-available
>
> Respect the user defined SPARK_LOCAL_DIRS container env when setup local dirs.
>  
> For example, we use hostPath for spark local dir.
> But we do not mount the sub disks directly to the pod, we mount a root path 
> for spark driver/executor pod.
>  
> For example, the root path is `/hadoop`.
>  
> And there are sub disks under that, likes `hadoop/1, /hadoop/2, /hadoop/3, 
> /hadoop4`.
>  
> And we want to define the SPARK_LOCAL_DIRS in the driver/executor pod env.
>  
> But now, the user specified SPARK_LOCAL_DIRS does not work.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46036) raise_error should only take the error class parameter for internal usage

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46036:
---
Labels: pull-request-available  (was: )

> raise_error should only take the error class parameter for internal usage
> -
>
> Key: SPARK-46036
> URL: https://issues.apache.org/jira/browse/SPARK-46036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>
> In [https://github.com/apache/spark/pull/42985] , we extended the 
> `raise_error` function to take an extra error-class parameter. However, this 
> is too powerful as users may use it to throw special internal errors that we 
> don't expect them to throw.
> It's useful to have this `raise_error` extension for internal usage. We 
> should create an `ExpressionBuilder` for `raise_error` function, so that we 
> only allow end users to pass the error message, not error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46091) [KUBERNETES] Respect the existing kubernetes container SPARK_LOCAL_DIRS env

2023-11-24 Thread Fei Wang (Jira)
Fei Wang created SPARK-46091:


 Summary: [KUBERNETES] Respect the existing kubernetes container 
SPARK_LOCAL_DIRS env
 Key: SPARK-46091
 URL: https://issues.apache.org/jira/browse/SPARK-46091
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Fei Wang


Respect the user defined SPARK_LOCAL_DIRS container env when setup local dirs.

 

For example, we use hostPath for spark local dir.

But we do not mount the sub disks directly to the pod, we mount a root path for 
spark driver/executor pod.

 

For example, the root path is `/hadoop`.

 

And there are sub disks under that, likes `hadoop/1, /hadoop/2, /hadoop/3, 
/hadoop4`.

 

And we want to define the SPARK_LOCAL_DIRS in the driver/executor pod env.

 

But now, the user specified SPARK_LOCAL_DIRS does not work.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46077) Error in postgresql when pushing down filter by timestamp_ntz field

2023-11-24 Thread Marina Krasilnikova (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marina Krasilnikova updated SPARK-46077:

Description: 
code to reproduce:

SparkSession sparkSession = SparkSession
.builder()
.appName("test-app")
.master("local[*]")
.config("spark.sql.timestampType", "TIMESTAMP_NTZ")
.getOrCreate();

String url = "...";

String catalogPropPrefix = "spark.sql.catalog.myc";
sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
sparkSession.conf().set(catalogPropPrefix + ".url", url);

Map options = new HashMap<>();
options.put("driver", "org.postgresql.Driver");
// options.put("pushDownPredicate", "false");  it works fine if  this line is 
uncommented

Dataset dataset = sparkSession.read()
.options(options)
.table("myc.demo.`My table`");

dataset.createOrReplaceTempView("view1");
String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
Dataset result = sparkSession.sql(sql);
result.show();
result.printSchema();

Field `my date` is of type timestamp. This code results in 
org.postgresql.util.PSQLException  syntax error

 

 

String sql = "select * from view1 where `my date` = to_timestamp('2021-04-01 
00:00:00', '-MM-dd HH:mm:ss')";  // this query also doesn't work

String sql = "select * from view1 where `my date` = date_trunc('DAY', 
to_timestamp('2021-04-01 00:00:00', '-MM-dd HH:mm:ss'))";  // but this is OK

 

Is it a bug or I got something wrong?

  was:
code to reproduce:

SparkSession sparkSession = SparkSession
.builder()
.appName("test-app")
.master("local[*]")
.config("spark.sql.timestampType", "TIMESTAMP_NTZ")
.getOrCreate();

String url = "...";

String catalogPropPrefix = "spark.sql.catalog.myc";
sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
sparkSession.conf().set(catalogPropPrefix + ".url", url);

Map options = new HashMap<>();
options.put("driver", "org.postgresql.Driver");
// options.put("pushDownPredicate", "false");  it works fine if  this line is 
uncommented

Dataset dataset = sparkSession.read()
.options(options)
.table("myc.demo.`My table`");

dataset.createOrReplaceTempView("view1");
String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
Dataset result = sparkSession.sql(sql);
result.show();
result.printSchema();

 

Field `my date` is of type timestamp. This code results in 
org.postgresql.util.PSQLException  syntax error , because resulting sql  lacks 
straight quotes in filter condition. (Something like this  "my date" = 
2021-04-01T00:00)

 


> Error in postgresql when pushing down filter by timestamp_ntz field
> ---
>
> Key: SPARK-46077
> URL: https://issues.apache.org/jira/browse/SPARK-46077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Marina Krasilnikova
>Priority: Minor
>
> code to reproduce:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("test-app")
> .master("local[*]")
> .config("spark.sql.timestampType", "TIMESTAMP_NTZ")
> .getOrCreate();
> String url = "...";
> String catalogPropPrefix = "spark.sql.catalog.myc";
> sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
> sparkSession.conf().set(catalogPropPrefix + ".url", url);
> Map options = new HashMap<>();
> options.put("driver", "org.postgresql.Driver");
> // options.put("pushDownPredicate", "false");  it works fine if  this line is 
> uncommented
> Dataset dataset = sparkSession.read()
> .options(options)
> .table("myc.demo.`My table`");
> dataset.createOrReplaceTempView("view1");
> String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
> Dataset result = sparkSession.sql(sql);
> result.show();
> result.printSchema();
> Field `my date` is of type timestamp. This code results in 
> org.postgresql.util.PSQLException  syntax error
>  
>  
> String sql = "select * from view1 where `my date` = to_timestamp('2021-04-01 
> 00:00:00', '-MM-dd HH:mm:ss')";  // this query also doesn't work
> String sql = "select * from view1 where `my date` = date_trunc('DAY', 
> to_timestamp('2021-04-01 00:00:00', '-MM-dd HH:mm:ss'))";  // but this is 
> OK
>  
> Is it a bug or I got something wrong?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46077) Error in postgresql when pushing down filter by timestamp_ntz field

2023-11-24 Thread Marina Krasilnikova (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marina Krasilnikova updated SPARK-46077:

Priority: Minor  (was: Major)

> Error in postgresql when pushing down filter by timestamp_ntz field
> ---
>
> Key: SPARK-46077
> URL: https://issues.apache.org/jira/browse/SPARK-46077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Marina Krasilnikova
>Priority: Minor
>
> code to reproduce:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("test-app")
> .master("local[*]")
> .config("spark.sql.timestampType", "TIMESTAMP_NTZ")
> .getOrCreate();
> String url = "...";
> String catalogPropPrefix = "spark.sql.catalog.myc";
> sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
> sparkSession.conf().set(catalogPropPrefix + ".url", url);
> Map options = new HashMap<>();
> options.put("driver", "org.postgresql.Driver");
> // options.put("pushDownPredicate", "false");  it works fine if  this line is 
> uncommented
> Dataset dataset = sparkSession.read()
> .options(options)
> .table("myc.demo.`My table`");
> dataset.createOrReplaceTempView("view1");
> String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
> Dataset result = sparkSession.sql(sql);
> result.show();
> result.printSchema();
>  
> Field `my date` is of type timestamp. This code results in 
> org.postgresql.util.PSQLException  syntax error , because resulting sql  
> lacks straight quotes in filter condition. (Something like this  "my date" = 
> 2021-04-01T00:00)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46016) Fix pandas API support list properly

2023-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46016.
--
Fix Version/s: 3.4.2
   4.0.0
   3.5.1
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/43996

> Fix pandas API support list properly
> 
>
> Key: SPARK-46016
> URL: https://issues.apache.org/jira/browse/SPARK-46016
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Currently Supported pandas API is not generated properly, so we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46090) Support stage level SQL configs

2023-11-24 Thread XiDuo You (Jira)
XiDuo You created SPARK-46090:
-

 Summary: Support stage level SQL configs
 Key: SPARK-46090
 URL: https://issues.apache.org/jira/browse/SPARK-46090
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You


AQE executes query plan stage by stage, so there is a chance to support stage 
level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46087) Sync PySpark dependencies in docs and dev requirements

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46087.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44000
[https://github.com/apache/spark/pull/44000]

> Sync PySpark dependencies in docs and dev requirements
> --
>
> Key: SPARK-46087
> URL: https://issues.apache.org/jira/browse/SPARK-46087
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There is inconsistency between docs and dev env. We should sync them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46087) Sync PySpark dependencies in docs and dev requirements

2023-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46087:
-

Assignee: Haejoon Lee

> Sync PySpark dependencies in docs and dev requirements
> --
>
> Key: SPARK-46087
> URL: https://issues.apache.org/jira/browse/SPARK-46087
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> There is inconsistency between docs and dev env. We should sync them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45959:
--

Assignee: Apache Spark

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45959:
--

Assignee: (was: Apache Spark)

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46069) Support unwrap timestamp type to date type

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46069:
--

Assignee: (was: Apache Spark)

> Support unwrap timestamp type to date type
> --
>
> Key: SPARK-46069
> URL: https://issues.apache.org/jira/browse/SPARK-46069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45959:
--

Assignee: (was: Apache Spark)

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45959) Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45959:
--

Assignee: Apache Spark

> Abusing DataSet.withColumn can cause huge tree with severe perf degradation
> ---
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to do some intial check in the withColumn api, before 
> creating a new projection, like if all the existing columns are still being 
> projected, and the new column being added has an expression which is not 
> depending on the output of the top node , but its child,  then instead of 
> adding a new project, the column can be added to the existing node.
> For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46069) Support unwrap timestamp type to date type

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46069:
--

Assignee: Apache Spark

> Support unwrap timestamp type to date type
> --
>
> Key: SPARK-46069
> URL: https://issues.apache.org/jira/browse/SPARK-46069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46089) Upgrade commons-lang3 to 3.14.0

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46089:
--

Assignee: Apache Spark

> Upgrade commons-lang3 to 3.14.0
> ---
>
> Key: SPARK-46089
> URL: https://issues.apache.org/jira/browse/SPARK-46089
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46074) [CONNECT][SCALA] Insufficient details in error when a UDF fails

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46074:
--

Assignee: (was: Apache Spark)

> [CONNECT][SCALA] Insufficient details in error when a UDF fails
> ---
>
> Key: SPARK-46074
> URL: https://issues.apache.org/jira/browse/SPARK-46074
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when a UDF fails the connect client does not receive the actual 
> error that caused the failure. 
> As an example, the error message looks like -
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: 
> grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to 
> stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost 
> task 2.3 in stage 0.0 (TID 10) (10.68.141.158 executor 0): 
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (` (Main$$$Lambda$4770/1714264622)`: (int) => int). 
> SQLSTATE: 39000 {code}
> In this case, the actual error was a {{{}java.lang.NoClassDefFoundError{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46089) Upgrade commons-lang3 to 3.14.0

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46089:
---
Labels: pull-request-available  (was: )

> Upgrade commons-lang3 to 3.14.0
> ---
>
> Key: SPARK-46089
> URL: https://issues.apache.org/jira/browse/SPARK-46089
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46074) [CONNECT][SCALA] Insufficient details in error when a UDF fails

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46074:
--

Assignee: Apache Spark

> [CONNECT][SCALA] Insufficient details in error when a UDF fails
> ---
>
> Key: SPARK-46074
> URL: https://issues.apache.org/jira/browse/SPARK-46074
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when a UDF fails the connect client does not receive the actual 
> error that caused the failure. 
> As an example, the error message looks like -
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: 
> grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to 
> stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost 
> task 2.3 in stage 0.0 (TID 10) (10.68.141.158 executor 0): 
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (` (Main$$$Lambda$4770/1714264622)`: (int) => int). 
> SQLSTATE: 39000 {code}
> In this case, the actual error was a {{{}java.lang.NoClassDefFoundError{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46089) Upgrade commons-lang3 to 3.14.0

2023-11-24 Thread Yang Jie (Jira)
Yang Jie created SPARK-46089:


 Summary: Upgrade commons-lang3 to 3.14.0
 Key: SPARK-46089
 URL: https://issues.apache.org/jira/browse/SPARK-46089
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46056:
--

Assignee: Apache Spark

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46066:
--

Assignee: Apache Spark

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46074) [CONNECT][SCALA] Insufficient details in error when a UDF fails

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46074:
--

Assignee: Apache Spark

> [CONNECT][SCALA] Insufficient details in error when a UDF fails
> ---
>
> Key: SPARK-46074
> URL: https://issues.apache.org/jira/browse/SPARK-46074
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when a UDF fails the connect client does not receive the actual 
> error that caused the failure. 
> As an example, the error message looks like -
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: 
> grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to 
> stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost 
> task 2.3 in stage 0.0 (TID 10) (10.68.141.158 executor 0): 
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (` (Main$$$Lambda$4770/1714264622)`: (int) => int). 
> SQLSTATE: 39000 {code}
> In this case, the actual error was a {{{}java.lang.NoClassDefFoundError{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46074) [CONNECT][SCALA] Insufficient details in error when a UDF fails

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46074:
--

Assignee: (was: Apache Spark)

> [CONNECT][SCALA] Insufficient details in error when a UDF fails
> ---
>
> Key: SPARK-46074
> URL: https://issues.apache.org/jira/browse/SPARK-46074
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when a UDF fails the connect client does not receive the actual 
> error that caused the failure. 
> As an example, the error message looks like -
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: 
> grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to 
> stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost 
> task 2.3 in stage 0.0 (TID 10) (10.68.141.158 executor 0): 
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (` (Main$$$Lambda$4770/1714264622)`: (int) => int). 
> SQLSTATE: 39000 {code}
> In this case, the actual error was a {{{}java.lang.NoClassDefFoundError{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46066:
--

Assignee: (was: Apache Spark)

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789389#comment-17789389
 ] 

ASF GitHub Bot commented on SPARK-46056:


User 'cosmind-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/43960

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46056:
--

Assignee: (was: Apache Spark)

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46066:
--

Assignee: Apache Spark

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46066) Use the Separators API instead of the String API to construct the DefaultPrettyPrinter

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46066:
--

Assignee: (was: Apache Spark)

> Use the Separators API instead of the String API to construct the 
> DefaultPrettyPrinter
> --
>
> Key: SPARK-46066
> URL: https://issues.apache.org/jira/browse/SPARK-46066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Constructor that specifies separator String to use between root values;
>  * if null, no separator is printed.
>  *
>  * Note: simply constructs a {@link SerializedString} out of parameter,
>  * calls {@link #DefaultPrettyPrinter(SerializableString)}
>  *
>  * @param rootSeparator String to use as root value separator
>  * @deprecated in 2.16. Use the Separators API instead.
>  */
> @Deprecated
> public DefaultPrettyPrinter(String rootSeparator) {
> this((rootSeparator == null) ? null : new 
> SerializedString(rootSeparator));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789387#comment-17789387
 ] 

ASF GitHub Bot commented on SPARK-46056:


User 'cosmind-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/43960

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46056:
--

Assignee: (was: Apache Spark)

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46056:
--

Assignee: Apache Spark

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45943:
--

Assignee: (was: Apache Spark)

> DataSourceV2Relation.computeStats throws IllegalStateException in test mode
> ---
>
> Key: SPARK-45943
> URL: https://issues.apache.org/jira/browse/SPARK-45943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> This issue surfaces when the new unit test of PR 
> SPARK-45866|https://github.com/apache/spark/pull/43824] is added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46085:
--

Assignee: Apache Spark

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45943) DataSourceV2Relation.computeStats throws IllegalStateException in test mode

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45943:
--

Assignee: Apache Spark

> DataSourceV2Relation.computeStats throws IllegalStateException in test mode
> ---
>
> Key: SPARK-45943
> URL: https://issues.apache.org/jira/browse/SPARK-45943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> This issue surfaces when the new unit test of PR 
> SPARK-45866|https://github.com/apache/spark/pull/43824] is added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46085:
--

Assignee: (was: Apache Spark)

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46085:
--

Assignee: (was: Apache Spark)

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46085) Dataset.groupingSets in Scala Spark Connect client

2023-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46085:
--

Assignee: Apache Spark

> Dataset.groupingSets in Scala Spark Connect client
> --
>
> Key: SPARK-46085
> URL: https://issues.apache.org/jira/browse/SPARK-46085
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Scala Spark Connect client for SPARK-45929



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values

2023-11-24 Thread Cosmin Dumitru (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Dumitru updated SPARK-46056:
---
Labels: pull-request-available  (was: )

> Vectorized parquet reader throws NPE when reading files with DecimalType 
> default values
> ---
>
> Key: SPARK-46056
> URL: https://issues.apache.org/jira/browse/SPARK-46056
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> The scenario is a bit more complicated than what the title says but it's not 
> that far fetched. 
>  # Write a parquet file with one column
>  # Evolve the schema and add a new column with DecimalType wide enough that 
> it doesn't fit in a long and has a default value. 
>  # Try to read the file with the new schema
>  # NPE 
> The issue lies in how the column vector stores DecimalTypes. It incorrectly 
> assumes that they fit in a long and try to write it to associated long array.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724]
>  
> In OnHeapColumnVector which extends WritableColumVector reserveInternal() 
> checks if the type is too wide and initializes the array elements. 
> [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568]
> isArray() returns true if the type is byteArrayDecimalType
> [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945]
>  
> Without the fix 
> {code:java}
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
> [info]   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> [info]   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code}
> fix PR [https://github.com/apache/spark/pull/43960]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45356) Adjust the Maven daily test configuration

2023-11-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45356.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43999
[https://github.com/apache/spark/pull/43999]

> Adjust the Maven daily test configuration
> -
>
> Key: SPARK-45356
> URL: https://issues.apache.org/jira/browse/SPARK-45356
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45356) Adjust the Maven daily test configuration

2023-11-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45356:


Assignee: Yang Jie

> Adjust the Maven daily test configuration
> -
>
> Key: SPARK-45356
> URL: https://issues.apache.org/jira/browse/SPARK-45356
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org