[jira] [Resolved] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved SPARK-45657.

Fix Version/s: 3.5.0
   Resolution: Fixed

The issue is fixed in 3.5.0

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: John Zhuge
>Priority: Major
> Fix For: 3.5.0
>
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-45657:
---
Affects Version/s: 3.4.1
   3.4.0

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45658:
-
Description: 
The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds relative to buildQuery output
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.

  was:
The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.


> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45658:
-
Priority: Major  (was: Critical)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-24 Thread Asif (Jira)
Asif created SPARK-45658:


 Summary: Canonicalization of DynamicPruningSubquery is broken
 Key: SPARK-45658
 URL: https://issues.apache.org/jira/browse/SPARK-45658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.5.1
Reporter: Asif


The canonicalization of (buildKeys: Seq[Expression]) in the class 
DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
calling 
buildKeys.map(_.canonicalized)
The  above would result in incorrect canonicalization as it would not be 
normalizing the exprIds
The fix is to use the buildQuery : LogicalPlan's output to normalize the 
buildKeys expression
as given below, using the standard approach.

buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),

Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 4:55 AM:
--

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be able to find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779326#comment-17779326
 ] 

John Zhuge commented on SPARK-45657:


It is fixed in main branch
{code:java}
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.7)
Type in expressions to have them evaluated.
Type :help for more information.
23/10/24 21:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.86.29:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1698208231783).
Spark session available as 'spark'.scala> spark.sql("select 1 id union select 
's2' id").cache()
val res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: 
string]scala> spark.sql("select 1 id union select 's2' 
id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan
val res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Union false, false
:- InMemoryRelation [id#11], StorageLevel(disk, memory, deserialized, 1 
replicas)
:     +- AdaptiveSparkPlan isFinalPlan=false
:        +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:           +- Exchange hashpartitioning(id#2, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
:              +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:                 +- Union
:                    :- Project [1 AS id#2]
:                    :  +- Scan OneRowRelation[]
:                    +- Project [s2 AS id#1]
:                       +- Scan OneRowRelation[]
+- Project [s3 AS s3#13]
   +- OneRowRelation {code}

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779283#comment-17779283
 ] 

John Zhuge commented on SPARK-45657:


Interesting, there is warning in Dataset.union
{code:java}
def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very 
specific use case:
  // using union to union many files or partitions.
  CombineUnions(Union(logicalPlan, other.logicalPlan))
} {code}

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-45657:
---
Affects Version/s: 3.3.2
   (was: 3.4.1)

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779282#comment-17779282
 ] 

John Zhuge commented on SPARK-45657:


Checking whether this is still an issue in main branch.

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 12:38 AM:
---

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 12:36 AM:
---

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281
 ] 

John Zhuge commented on SPARK-45657:


Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45656) Fix observation when named observations with the same name on different datasets.

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45656:
---
Labels: pull-request-available  (was: )

> Fix observation when named observations with the same name on different 
> datasets.
> -
>
> Key: SPARK-45656
> URL: https://issues.apache.org/jira/browse/SPARK-45656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)
John Zhuge created SPARK-45657:
--

 Summary: Caching SQL UNION of different column data types does not 
work inside Dataset.union
 Key: SPARK-45657
 URL: https://issues.apache.org/jira/browse/SPARK-45657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: John Zhuge


 

Cache SQL UNION of 2 sides with different column data types
{code:java}
scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
Dataset.union does not leverage the cache
{code:java}
scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
's3'")).queryExecution.optimizedPlan
res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Union false, false
:- Aggregate [id#109], [id#109]
:  +- Union false, false
:     :- Project [1 AS id#109]
:     :  +- OneRowRelation
:     +- Project [s2 AS id#108]
:        +- OneRowRelation
+- Project [s3 AS s3#111]
   +- OneRowRelation {code}
SQL UNION of the cached SQL UNION does use the cache! Please note 
`InMemoryRelation` used.
{code:java}
scala> spark.sql("(select 1 id union select 's2' id) union select 
's3'").queryExecution.optimizedPlan
res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Aggregate [id#117], [id#117]
+- Union false, false
   :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
replicas)
   :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
   :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
[plan_id=241]
   :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
output=[id#100])
   :              +- Union
   :                 :- *(1) Project [1 AS id#100]
   :                 :  +- *(1) Scan OneRowRelation[]
   :                 +- *(2) Project [s2 AS id#99]
   :                    +- *(2) Scan OneRowRelation[]
   +- Project [s3 AS s3#116]
      +- OneRowRelation {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45656) Fix observation when named observations with the same name on different datasets.

2023-10-24 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-45656:
-

 Summary: Fix observation when named observations with the same 
name on different datasets.
 Key: SPARK-45656
 URL: https://issues.apache.org/jira/browse/SPARK-45656
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45648) Add sql/api and common/utils to modules.py

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45648:


Assignee: Ruifeng Zheng

> Add sql/api and common/utils to modules.py
> --
>
> Key: SPARK-45648
> URL: https://issues.apache.org/jira/browse/SPARK-45648
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45648) Add sql/api and common/utils to modules.py

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45648.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43501
[https://github.com/apache/spark/pull/43501]

> Add sql/api and common/utils to modules.py
> --
>
> Key: SPARK-45648
> URL: https://issues.apache.org/jira/browse/SPARK-45648
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45648) Add sql/api and common/utils to modules.py

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45648:
---
Labels: pull-request-available  (was: )

> Add sql/api and common/utils to modules.py
> --
>
> Key: SPARK-45648
> URL: https://issues.apache.org/jira/browse/SPARK-45648
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45622) java -target should use java.version instead of 17

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45622.
--
Resolution: Invalid

> java -target should use java.version instead of 17
> --
>
> Key: SPARK-45622
> URL: https://issues.apache.org/jira/browse/SPARK-45622
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45651) Snapshots of some packages are not published any more

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45651:


Assignee: Enrico Minack

> Snapshots of some packages are not published any more
> -
>
> Key: SPARK-45651
> URL: https://issues.apache.org/jira/browse/SPARK-45651
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>  Labels: pull-request-available
>
> Snapshots of some packages are not been published anymore, e.g. 
> spark-sql_2.13-4.0.0 has not been published since Sep, 13th: 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/
> There have been some attempts to fix CI: SPARK-45535 SPARK-45536
> Assumption is that memory consumption during build exceeds the available 
> memory of the Github host.
> The following could be attempted:
> - enable manual trigger of the {{publish_snapshots.yml}} workflow
> - enable some memory use logging to proof that exceeded memory is the root 
> cause
> - attempt to reduce memory footprint and see impact in above logging
> - revert memory use logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45651) Snapshots of some packages are not published any more

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45651.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43512
[https://github.com/apache/spark/pull/43512]

> Snapshots of some packages are not published any more
> -
>
> Key: SPARK-45651
> URL: https://issues.apache.org/jira/browse/SPARK-45651
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Snapshots of some packages are not been published anymore, e.g. 
> spark-sql_2.13-4.0.0 has not been published since Sep, 13th: 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/
> There have been some attempts to fix CI: SPARK-45535 SPARK-45536
> Assumption is that memory consumption during build exceeds the available 
> memory of the Github host.
> The following could be attempted:
> - enable manual trigger of the {{publish_snapshots.yml}} workflow
> - enable some memory use logging to proof that exceeded memory is the root 
> cause
> - attempt to reduce memory footprint and see impact in above logging
> - revert memory use logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45640) Fix flaky ProtobufCatalystDataConversionSuite

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45640:


Assignee: BingKun Pan

> Fix flaky ProtobufCatalystDataConversionSuite
> -
>
> Key: SPARK-45640
> URL: https://issues.apache.org/jira/browse/SPARK-45640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45640) Fix flaky ProtobufCatalystDataConversionSuite

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45640.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43493
[https://github.com/apache/spark/pull/43493]

> Fix flaky ProtobufCatalystDataConversionSuite
> -
>
> Key: SPARK-45640
> URL: https://issues.apache.org/jira/browse/SPARK-45640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-10-24 Thread Bhuwan Sahni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779264#comment-17779264
 ] 

Bhuwan Sahni commented on SPARK-45655:
--

PR link https://github.com/apache/spark/pull/43517

> current_date() not supported in Streaming Query Observed metrics
> 
>
> Key: SPARK-45655
> URL: https://issues.apache.org/jira/browse/SPARK-45655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Streaming queries do not support current_date() inside CollectMetrics. The 
> primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
> marked as non-deterministic. However, {{current_date}} and 
> {{current_timestamp}} are both deterministic today, and 
> {{current_batch_timestamp}} should be the same.
>  
> As an example, the query below fails due to observe call on the DataFrame.
>  
> {quote}val inputData = MemoryStream[Timestamp]
> inputData.toDF()
>       .filter("value < current_date()")
>       .observe("metrics", count(expr("value >= 
> current_date()")).alias("dropped"))
>       .writeStream
>       .queryName("ts_metrics_test")
>       .format("memory")
>       .outputMode("append")
>       .start()
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45655:
---
Labels: pull-request-available  (was: )

> current_date() not supported in Streaming Query Observed metrics
> 
>
> Key: SPARK-45655
> URL: https://issues.apache.org/jira/browse/SPARK-45655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Streaming queries do not support current_date() inside CollectMetrics. The 
> primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
> marked as non-deterministic. However, {{current_date}} and 
> {{current_timestamp}} are both deterministic today, and 
> {{current_batch_timestamp}} should be the same.
>  
> As an example, the query below fails due to observe call on the DataFrame.
>  
> {quote}val inputData = MemoryStream[Timestamp]
> inputData.toDF()
>       .filter("value < current_date()")
>       .observe("metrics", count(expr("value >= 
> current_date()")).alias("dropped"))
>       .writeStream
>       .queryName("ts_metrics_test")
>       .format("memory")
>       .outputMode("append")
>       .start()
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45654) Add Python data source write API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45654:
---
Labels: pull-request-available  (was: )

> Add Python data source write API
> 
>
> Key: SPARK-45654
> URL: https://issues.apache.org/jira/browse/SPARK-45654
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add Python data source write API in datasource.py 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-10-24 Thread Bhuwan Sahni (Jira)
Bhuwan Sahni created SPARK-45655:


 Summary: current_date() not supported in Streaming Query Observed 
metrics
 Key: SPARK-45655
 URL: https://issues.apache.org/jira/browse/SPARK-45655
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.5.0, 3.4.1
Reporter: Bhuwan Sahni


Streaming queries do not support current_date() inside CollectMetrics. The 
primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
marked as non-deterministic. However, {{current_date}} and 
{{current_timestamp}} are both deterministic today, and 
{{current_batch_timestamp}} should be the same.

 

As an example, the query below fails due to observe call on the DataFrame.

 
{quote}val inputData = MemoryStream[Timestamp]

inputData.toDF()
      .filter("value < current_date()")
      .observe("metrics", count(expr("value >= 
current_date()")).alias("dropped"))
      .writeStream
      .queryName("ts_metrics_test")
      .format("memory")
      .outputMode("append")
      .start()
{quote}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-10-24 Thread Bhuwan Sahni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779239#comment-17779239
 ] 

Bhuwan Sahni commented on SPARK-45655:
--

I am working on a fix for this issue, and will submit a PR soon.

> current_date() not supported in Streaming Query Observed metrics
> 
>
> Key: SPARK-45655
> URL: https://issues.apache.org/jira/browse/SPARK-45655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bhuwan Sahni
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Streaming queries do not support current_date() inside CollectMetrics. The 
> primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
> marked as non-deterministic. However, {{current_date}} and 
> {{current_timestamp}} are both deterministic today, and 
> {{current_batch_timestamp}} should be the same.
>  
> As an example, the query below fails due to observe call on the DataFrame.
>  
> {quote}val inputData = MemoryStream[Timestamp]
> inputData.toDF()
>       .filter("value < current_date()")
>       .observe("metrics", count(expr("value >= 
> current_date()")).alias("dropped"))
>       .writeStream
>       .queryName("ts_metrics_test")
>       .format("memory")
>       .outputMode("append")
>       .start()
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45654) Add Python data source write API

2023-10-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-45654:


 Summary: Add Python data source write API
 Key: SPARK-45654
 URL: https://issues.apache.org/jira/browse/SPARK-45654
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add Python data source write API in datasource.py 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45503) RocksDB State Store to Use LZ4 Compression

2023-10-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45503.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43338
[https://github.com/apache/spark/pull/43338]

> RocksDB State Store to Use LZ4 Compression
> --
>
> Key: SPARK-45503
> URL: https://issues.apache.org/jira/browse/SPARK-45503
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> LZ4 is generally faster than Snappy. That's probably we use LZ4 in changelogs 
> and other places by default. However, we don't change RocksDB's default of 
> Snappy compression style. The RocksDB Team recommend LZ4 or ZSTD and the 
> default is kept to Snappy only for backward compatible reason. We should use 
> LZ4 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.

2023-10-24 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang resolved SPARK-45653.
--
Resolution: Not A Problem

> Refractor XMLSuite to allow other test suites to easily extend and override.
> 
>
> Key: SPARK-45653
> URL: https://issues.apache.org/jira/browse/SPARK-45653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Refactor XmlSuite to integrate dataframe readers, allowing other test suites 
> to easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45653:
---
Labels: pull-request-available  (was: )

> Refractor XMLSuite to allow other test suites to easily extend and override.
> 
>
> Key: SPARK-45653
> URL: https://issues.apache.org/jira/browse/SPARK-45653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Refactor XmlSuite to integrate dataframe readers, allowing other test suites 
> to easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45524) Initial support for Python data source read API

2023-10-24 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-45524.
---
Fix Version/s: 4.0.0
 Assignee: Allison Wang
   Resolution: Fixed

Issue resolved by pull request 43360
https://github.com/apache/spark/pull/43360

> Initial support for Python data source read API
> ---
>
> Key: SPARK-45524
> URL: https://issues.apache.org/jira/browse/SPARK-45524
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add API for data source and data source reader and add Catalyst + execution 
> support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.

2023-10-24 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-45653:
-
Summary: Refractor XMLSuite to allow other test suites to easily extend and 
override.  (was: Refractor XMLSuite)

> Refractor XMLSuite to allow other test suites to easily extend and override.
> 
>
> Key: SPARK-45653
> URL: https://issues.apache.org/jira/browse/SPARK-45653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Shujing Yang
>Priority: Major
>
> Refactor XmlSuite to integrate dataframe readers, allowing other test suites 
> to easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45653) Refractor XMLSuite

2023-10-24 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-45653:


 Summary: Refractor XMLSuite
 Key: SPARK-45653
 URL: https://issues.apache.org/jira/browse/SPARK-45653
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Shujing Yang


Refactor XmlSuite to integrate dataframe readers, allowing other test suites to 
easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45643) Replace `s.c.mutable.MapOps#transform` with `s.c.mutable.MapOps#mapValuesInPlace`

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45643:
-

Assignee: Yang Jie

> Replace `s.c.mutable.MapOps#transform` with 
> `s.c.mutable.MapOps#mapValuesInPlace`
> -
>
> Key: SPARK-45643
> URL: https://issues.apache.org/jira/browse/SPARK-45643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> @deprecated("Use mapValuesInPlace instead", "2.13.0")
> @inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45643) Replace `s.c.mutable.MapOps#transform` with `s.c.mutable.MapOps#mapValuesInPlace`

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45643.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43500
[https://github.com/apache/spark/pull/43500]

> Replace `s.c.mutable.MapOps#transform` with 
> `s.c.mutable.MapOps#mapValuesInPlace`
> -
>
> Key: SPARK-45643
> URL: https://issues.apache.org/jira/browse/SPARK-45643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> @deprecated("Use mapValuesInPlace instead", "2.13.0")
> @inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45651) Snapshots of some packages are not published any more

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45651:
---
Labels: pull-request-available  (was: )

> Snapshots of some packages are not published any more
> -
>
> Key: SPARK-45651
> URL: https://issues.apache.org/jira/browse/SPARK-45651
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: pull-request-available
>
> Snapshots of some packages are not been published anymore, e.g. 
> spark-sql_2.13-4.0.0 has not been published since Sep, 13th: 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/
> There have been some attempts to fix CI: SPARK-45535 SPARK-45536
> Assumption is that memory consumption during build exceeds the available 
> memory of the Github host.
> The following could be attempted:
> - enable manual trigger of the {{publish_snapshots.yml}} workflow
> - enable some memory use logging to proof that exceeded memory is the root 
> cause
> - attempt to reduce memory footprint and see impact in above logging
> - revert memory use logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-24 Thread Chao Sun (Jira)
Chao Sun created SPARK-45652:


 Summary: SPJ: Handle empty input partitions after dynamic filtering
 Key: SPARK-45652
 URL: https://issues.apache.org/jira/browse/SPARK-45652
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.1
Reporter: Chao Sun


When the number of input partitions become 0 after dynamic filtering, in 
{{BatchScanExec}}, currently SPJ will fail with error:
{code}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
{code}

This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44405) Reduce code duplication in group-based DELETE and MERGE tests

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44405:
---
Labels: pull-request-available  (was: )

> Reduce code duplication in group-based DELETE and MERGE tests
> -
>
> Key: SPARK-44405
> URL: https://issues.apache.org/jira/browse/SPARK-44405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
>
> See [this|https://github.com/apache/spark/pull/41600#discussion_r1230014119] 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45651) Snapshots of some packages are not published any more

2023-10-24 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-45651:
-

 Summary: Snapshots of some packages are not published any more
 Key: SPARK-45651
 URL: https://issues.apache.org/jira/browse/SPARK-45651
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 4.0.0
Reporter: Enrico Minack


Snapshots of some packages are not been published anymore, e.g. 
spark-sql_2.13-4.0.0 has not been published since Sep, 13th: 
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/

There have been some attempts to fix CI: SPARK-45535 SPARK-45536

Assumption is that memory consumption during build exceeds the available memory 
of the Github host.

The following could be attempted:
- enable manual trigger of the {{publish_snapshots.yml}} workflow
- enable some memory use logging to proof that exceeded memory is the root cause
- attempt to reduce memory footprint and see impact in above logging
- revert memory use logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45646:


Assignee: Cheng Pan

> Remove hardcoding time variables prior to Hive 2.0
> --
>
> Key: SPARK-45646
> URL: https://issues.apache.org/jira/browse/SPARK-45646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45646.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43506
[https://github.com/apache/spark/pull/43506]

> Remove hardcoding time variables prior to Hive 2.0
> --
>
> Key: SPARK-45646
> URL: https://issues.apache.org/jira/browse/SPARK-45646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-24 Thread tangjiafu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangjiafu updated SPARK-45650:
--
Description: 
Now the ci is executing  ./dev/mina will generate an incompatible error with 
scala2.12. Sorry, I don't know how to fix it
[info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some time)...
[info] [launcher] getting Scala 2.12.18 (for sbt)...

  was:
[info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some time)...
[info] [launcher] getting Scala 2.12.18 (for sbt)...


> fix dev/mina get scala 2.12 
> 
>
> Key: SPARK-45650
> URL: https://issues.apache.org/jira/browse/SPARK-45650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: tangjiafu
>Priority: Major
>
> Now the ci is executing  ./dev/mina will generate an incompatible error with 
> scala2.12. Sorry, I don't know how to fix it
> [info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some 
> time)...
> [info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-24 Thread tangjiafu (Jira)
tangjiafu created SPARK-45650:
-

 Summary: fix dev/mina get scala 2.12 
 Key: SPARK-45650
 URL: https://issues.apache.org/jira/browse/SPARK-45650
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: tangjiafu


[info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some time)...
[info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

2023-10-24 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779048#comment-17779048
 ] 

Rasmus Schøler Sørensen commented on SPARK-31836:
-

We have also encountered this bug. Rather unfortunate that this bug has 
persisted for at least 3.5 years without resolution.

We would like to do what we can to help resolve this issue.

In the mean time, I guess we will mitigate this issue by first loading the raw 
file data into a "raw" table (using `input_file_name()` to populate column with 
source input file name column), then process the raw table and apply the UDF in 
a second step, outputting to a second table.

For the record, I've included our observations regarding the extend of this bug 
below:
h2. Findings:

The issue occurs whenever a Python UDF is used, both when using `spark.read` 
and when using `spark.readStream`.
We did not observe any cases where the read method would affect whether the bug 
manifested or not (i.e. `spark.read` vs `spark.readStream.text` vs 'cloudFiles' 
stream).
In all cases, the bug only manifested in when `input_file_name()` was used in 
conjunction with a UDF.

The issue was observed in the following versions, regardless of whether the UDF 
was placed before or after `input_file_name()`:
 - Spark 3.5.0 (Databricks Runtime 14.1).
 - Spark 3.4.1 (Databricks Runtime 13.3).
 - Spark 3.3.2 (Databricks Runtime 12.2).
 - Spark 3.3.0 (Databricks Runtime 11.3).

For the following versions, we only observed the issue when the UDF column was 
placed *before* `input_file_name()`:
 - Spark 3.2.1 (Databricks Runtime 10.4).
 - Spark 3.1.2 (Databricks Runtime  9.1).

 
h2. Methodology:

We tested four ways of loading data:
 # Using `spark.read`, without a Python UDF.
 # Using `spark.read`, with a Python UDF.
 # Using `spark.readStream`, without a Python UDF.
 # Using `spark.readStream`, with a Python UDF.

The following read methods and formats were tested:
 - Raw text-file read: `spark.read.format('text').load(...)`
 - Text-file stream: `spark.readStream.text(...)`.
 - 'cloudFiles' text stream: 
`spark.readStream.format('cloudFiles').option("cloudFiles.format", 
"text").load(...)`

Input data consisted of a single folder with 2206 text files, each text file 
containing an average of 732 lines, with each line representing a single value 
(in this case, a file path), in total 1615051 rows/lines across all files.

All reads were output to a delta table. The delta-table was subsequently 
analyzed for number of distinct values of the `input_file_name` column.
In cases where the bug manifested, the number of distinct files was typically 
around 70-140 (with the expected/correct number being 2206).

Everything was run inside a "Databricks" environment. Note that Databricks 
sometimes adds some "special sauce" to their version of Spark, although 
generally the "Databricks Spark" is very close to standard Apache Spark.

The cluster used for all tests was a 4-core single-node cluster.

 

> input_file_name() gives wrong value following Python UDF usage
> --
>
> Key: SPARK-31836
> URL: https://issues.apache.org/jira/browse/SPARK-31836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wesley Hildebrandt
>Priority: Major
>
> I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.
> The following commands demonstrate that the input_file_name() function 
> sometimes returns the wrong filename following usage of a Python UDF:
> {code}
> $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done
> $ pyspark
> >>> import pyspark.sql.functions as F
> >>> spark.readStream.text('file:///tmp/test-file-*', 
> >>> wholetext=True).withColumn('file1', 
> >>> F.input_file_name()).withColumn('udf', F.udf(lambda 
> >>> x:x)('value')).withColumn('file2', 
> >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
> >>> df,_: df.select('file1','file2').show(truncate=False, 
> >>> vertical=True)).start().awaitTermination()
> {code}
> A few notes about this bug:
>  * It happens with many different files, so it's not related to the file 
> contents
>  * It also happens loading files from HDFS, so storage location is not a 
> factor
>  * It also happens using .csv() to read the files instead of .text(), so 
> input format is not a factor
>  * I have not been able to cause the error without using readStream, so it 
> seems to be related to streaming
>  * The bug also happens using spark-submit to send a job to my cluster
>  * I haven't tested an older version, but it's possible that Spark pulls 
> 24958 and 25321([https://github.com/apache/spark/pull/24958], 
> [https://github.com/apache/spark/pull/25321]) to fix issue 28153 
> 

[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-24 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Description: 
A Spark job ran successfully with Spark 3.2.x and 3.3.x. 

But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job with 
the same data the following always occurs now:
{code}
scala.Some is not a valid external type for schema of array
{code}

The corresponding stacktrace is:
{code}
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
worker for task 0.0 in stage 0.0 (TID 0)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:141) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
[spark-core_2.12-3.5.0.jar:3.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
worker for task 1.0 in stage 0.0 (TID 1)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-24 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Summary: After upgrading to Spark 3.4.1 and 3.5.0 we receive 
RuntimeException "scala.Some is not a valid external type for schema of 
array"  (was: After upgrading to Spark 3.4.1 and 3.5.0 we receive 
RuntimeException "is not valid external type")

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) the following always 
> occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> 

[jira] [Updated] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45649:
---
Labels: pull-request-available  (was: )

> Unify the prepare framework for `OffsetWindowFunctionFrame`
> ---
>
> Key: SPARK-45649
> URL: https://issues.apache.org/jira/browse/SPARK-45649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the implementation the `prepare` of  all the 
> `OffsetWindowFunctionFrame` have the same code logic show below.
> ```
>   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
> if (offset > rows.length) {
>   fillDefaultValue(EmptyRow)
> } else {
>   resetStates(rows)
>   if (ignoreNulls) {
> ...
>   } else {
> ...
>   }
> }
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`

2023-10-24 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45649:
---
Summary: Unify the prepare framework for `OffsetWindowFunctionFrame`  (was: 
Unified the prepare framework for `OffsetWindowFunctionFrame`)

> Unify the prepare framework for `OffsetWindowFunctionFrame`
> ---
>
> Key: SPARK-45649
> URL: https://issues.apache.org/jira/browse/SPARK-45649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, the implementation the `prepare` of  all the 
> `OffsetWindowFunctionFrame` have the same code logic show below.
> ```
>   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
> if (offset > rows.length) {
>   fillDefaultValue(EmptyRow)
> } else {
>   resetStates(rows)
>   if (ignoreNulls) {
> ...
>   } else {
> ...
>   }
> }
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45649) Unified the prepare framework for `OffsetWindowFunctionFrame`

2023-10-24 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45649:
--

 Summary: Unified the prepare framework for 
`OffsetWindowFunctionFrame`
 Key: SPARK-45649
 URL: https://issues.apache.org/jira/browse/SPARK-45649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Currently, the implementation the `prepare` of  all the 
`OffsetWindowFunctionFrame` have the same code logic show below.
```
  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
if (offset > rows.length) {
  fillDefaultValue(EmptyRow)
} else {
  resetStates(rows)
  if (ignoreNulls) {
...
  } else {
...
  }
}
  }
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45648) Add sql/api and common/utils to modules.py

2023-10-24 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45648:
-

 Summary: Add sql/api and common/utils to modules.py
 Key: SPARK-45648
 URL: https://issues.apache.org/jira/browse/SPARK-45648
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45647) Spark Connect API to propagate per request context

2023-10-24 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-45647:
-

 Summary: Spark Connect API to propagate per request context
 Key: SPARK-45647
 URL: https://issues.apache.org/jira/browse/SPARK-45647
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


There is an extension point to pass arbitrary proto extension in Spark Connect 
UserContext, but there is no API to do this in the client. Add a SparkSession 
API to attach extra protos that will be sent with all requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45561) Convert TINYINT catalyst properly in MySQL Dialect

2023-10-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45561.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43390
[https://github.com/apache/spark/pull/43390]

> Convert TINYINT catalyst properly in MySQL Dialect
> --
>
> Key: SPARK-45561
> URL: https://issues.apache.org/jira/browse/SPARK-45561
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Michael Zhang
>Assignee: Michael Zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> MySQL dialect currently incorrectly converts catalyst types `TINYINT` to 
> BYTE. However, both MySQL doesn't have a byte type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45561) Convert TINYINT catalyst properly in MySQL Dialect

2023-10-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45561:


Assignee: Michael Zhang

> Convert TINYINT catalyst properly in MySQL Dialect
> --
>
> Key: SPARK-45561
> URL: https://issues.apache.org/jira/browse/SPARK-45561
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Michael Zhang
>Assignee: Michael Zhang
>Priority: Minor
>  Labels: pull-request-available
>
> MySQL dialect currently incorrectly converts catalyst types `TINYINT` to 
> BYTE. However, both MySQL doesn't have a byte type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-26052:
---
Labels: bulk-closed pull-request-available  (was: bulk-closed)

> Spark should output a _SUCCESS file for every partition correctly written
> -
>
> Key: SPARK-26052
> URL: https://issues.apache.org/jira/browse/SPARK-26052
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Matt Matolcsi
>Priority: Minor
>  Labels: bulk-closed, pull-request-available
>
> When writing a set of partitioned Parquet files to HDFS using 
> dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table 
> after successful completion, though the actual Parquet files will end up in 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ 
> If partitions are written out one at a time (e.g., an hourly ETL), the 
> _SUCCESS file is overwritten by each subsequent run and information on what 
> partitions were correctly written is lost.
> I would like to be able to keep track of what partitions were successfully 
> written in HDFS. I think this could be done by writing the _SUCCESS files to 
> the same partition directories where the Parquet files reside, i.e., 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/
> Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I 
> don't think this should break partition discovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: Apache Spark

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: (was: Apache Spark)

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45646:
--

Assignee: (was: Apache Spark)

> Remove hardcoding time variables prior to Hive 2.0
> --
>
> Key: SPARK-45646
> URL: https://issues.apache.org/jira/browse/SPARK-45646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45646:
--

Assignee: Apache Spark

> Remove hardcoding time variables prior to Hive 2.0
> --
>
> Key: SPARK-45646
> URL: https://issues.apache.org/jira/browse/SPARK-45646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: Apache Spark

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: (was: Apache Spark)

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: Apache Spark

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45575) support time travel options for df read API

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45575:
--

Assignee: (was: Apache Spark)

> support time travel options for df read API
> ---
>
> Key: SPARK-45575
> URL: https://issues.apache.org/jira/browse/SPARK-45575
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42746) Add the LISTAGG() aggregate function

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-42746:
--

Assignee: Apache Spark

> Add the LISTAGG() aggregate function
> 
>
> Key: SPARK-42746
> URL: https://issues.apache.org/jira/browse/SPARK-42746
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> {{listagg()}} is a common and useful aggregation function to concatenate 
> string values in a column, optionally by a certain order. The systems below 
> have supported such function already:
>  * Oracle: 
> [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030]
>  * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg]
>  * Amazon Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html]
>  * Google BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg]
> Need to introduce this new aggregate in Spark, both as a regular aggregate 
> and as a window function.
> Proposed syntax:
> {code:sql}
> LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
>  ) ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "is not valid external type"

2023-10-24 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Description: 
A Spark job ran successfully with Spark 3.2.x and 3.3.x. 

But after upgrading to 3.4.1 (as well as with 3.5.0) the following always 
occurs now:
{code}
scala.Some is not a valid external type for schema of array
{code}

The corresponding stacktrace is:
{code}
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
worker for task 0.0 in stage 0.0 (TID 0)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:141) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
[spark-core_2.12-3.5.0.jar:3.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
worker for task 1.0 in stage 0.0 (TID 1)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 

[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "is not valid external type"

2023-10-24 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Description: 
A Spark job ran successfully with Spark 3.2.x and 3.3.x. 

But after upgrading to 3.4.1 (as well as with 3.5.0) the following always 
occurs now:
{code}
scala.Some is not a valid external type for schema of array
{code}

The corresponding stacktrace is:
{code}
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
worker for task 0.0 in stage 0.0 (TID 0)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:141) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
[spark-core_2.12-3.5.0.jar:3.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
worker for task 1.0 in stage 0.0 (TID 1)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 

[jira] [Resolved] (SPARK-44752) XML: Update Spark Docs

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44752.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43350
[https://github.com/apache/spark/pull/43350]

> XML: Update Spark Docs
> --
>
> Key: SPARK-44752
> URL: https://issues.apache.org/jira/browse/SPARK-44752
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: tangjiafu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
>  [https://spark.apache.org/docs/latest/sql-data-sources.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44752) XML: Update Spark Docs

2023-10-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44752:


Assignee: tangjiafu

> XML: Update Spark Docs
> --
>
> Key: SPARK-44752
> URL: https://issues.apache.org/jira/browse/SPARK-44752
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: tangjiafu
>Priority: Major
>  Labels: pull-request-available
>
>  [https://spark.apache.org/docs/latest/sql-data-sources.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45641) Display the start time of an app following the Total Uptime segment on AllJobsPage

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45641.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43495
[https://github.com/apache/spark/pull/43495]

> Display the start time of an app following the Total Uptime segment on 
> AllJobsPage
> --
>
> Key: SPARK-45641
> URL: https://issues.apache.org/jira/browse/SPARK-45641
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45641) Display the start time of an app following the Total Uptime segment on AllJobsPage

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45641:
-

Assignee: Kent Yao

> Display the start time of an app following the Total Uptime segment on 
> AllJobsPage
> --
>
> Key: SPARK-45641
> URL: https://issues.apache.org/jira/browse/SPARK-45641
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45646:
---
Labels: pull-request-available  (was: )

> Remove hardcoding time variables prior to Hive 2.0
> --
>
> Key: SPARK-45646
> URL: https://issues.apache.org/jira/browse/SPARK-45646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE

2023-10-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45626:


Assignee: BingKun Pan

> Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
> -
>
> Key: SPARK-45626
> URL: https://issues.apache.org/jira/browse/SPARK-45626
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE

2023-10-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45626.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43479
[https://github.com/apache/spark/pull/43479]

> Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
> -
>
> Key: SPARK-45626
> URL: https://issues.apache.org/jira/browse/SPARK-45626
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0

2023-10-24 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-45646:
-

 Summary: Remove hardcoding time variables prior to Hive 2.0
 Key: SPARK-45646
 URL: https://issues.apache.org/jira/browse/SPARK-45646
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE

2023-10-24 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45626:

Summary: Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE  
(was: Fix variable name of error-class & assign names to the error class 
_LEGACY_ERROR_TEMP_1055)

> Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
> -
>
> Key: SPARK-45626
> URL: https://issues.apache.org/jira/browse/SPARK-45626
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45630) Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`

2023-10-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45630:


Assignee: Yang Jie

> Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`
> ---
>
> Key: SPARK-45630
> URL: https://issues.apache.org/jira/browse/SPARK-45630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, YARN
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45430) FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows

2023-10-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45430.
-
Fix Version/s: 3.3.4
   3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43236
[https://github.com/apache/spark/pull/43236]

> FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of 
> rows 
> --
>
> Key: SPARK-45430
> URL: https://issues.apache.org/jira/browse/SPARK-45430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2
>
>
> Failure when function that utilized `FramelessOffsetWindowFunctionFrame` is 
> used with `ignoreNulls = true` and `offset > rowCount`.
> e.g. 
> ```
> select x, lead(x, 5) IGNORE NULLS over (order by x) from (select 
> explode(sequence(1, 3)) x)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45430) FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows

2023-10-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45430:
---

Assignee: Vitalii Li

> FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of 
> rows 
> --
>
> Key: SPARK-45430
> URL: https://issues.apache.org/jira/browse/SPARK-45430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
>  Labels: pull-request-available
>
> Failure when function that utilized `FramelessOffsetWindowFunctionFrame` is 
> used with `ignoreNulls = true` and `offset > rowCount`.
> e.g. 
> ```
> select x, lead(x, 5) IGNORE NULLS over (order by x) from (select 
> explode(sequence(1, 3)) x)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45630) Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`

2023-10-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45630.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43482
[https://github.com/apache/spark/pull/43482]

> Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`
> ---
>
> Key: SPARK-45630
> URL: https://issues.apache.org/jira/browse/SPARK-45630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, YARN
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`

2023-10-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-45645:
-
Parent: (was: SPARK-45314)
Issue Type: Improvement  (was: Sub-task)

> Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
> --
>
> Key: SPARK-45645
> URL: https://issues.apache.org/jira/browse/SPARK-45645
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44032) Remove threeten-extra exclusion in enforceBytecodeVersion rule

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44032:
-

Assignee: Dongjoon Hyun

> Remove threeten-extra exclusion in enforceBytecodeVersion rule
> --
>
> Key: SPARK-44032
> URL: https://issues.apache.org/jira/browse/SPARK-44032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Bowen Liang
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>
> We can remove `threeten-extra` library exclusion rule because Apache Spark 
> 4.0.0's minimum Java is 17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44032) Remove threeten-extra exclusion in enforceBytecodeVersion rule

2023-10-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44032.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43504
[https://github.com/apache/spark/pull/43504]

> Remove threeten-extra exclusion in enforceBytecodeVersion rule
> --
>
> Key: SPARK-44032
> URL: https://issues.apache.org/jira/browse/SPARK-44032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Bowen Liang
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We can remove `threeten-extra` library exclusion rule because Apache Spark 
> 4.0.0's minimum Java is 17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45642) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`

2023-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45642:
---
Labels: pull-request-available  (was: )

> Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
> --
>
> Key: SPARK-45642
> URL: https://issues.apache.org/jira/browse/SPARK-45642
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`

2023-10-24 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan resolved SPARK-45645.
-
Resolution: Duplicate

> Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
> --
>
> Key: SPARK-45645
> URL: https://issues.apache.org/jira/browse/SPARK-45645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`

2023-10-24 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-45645:
---

 Summary: Fix `FileSystem.isFile & FileSystem.isDirectory is 
deprecated`
 Key: SPARK-45645
 URL: https://issues.apache.org/jira/browse/SPARK-45645
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE

2023-10-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-45632:
-

Assignee: XiDuo You

> Table cache should avoid unnecessary ColumnarToRow when enable AQE
> --
>
> Key: SPARK-45632
> URL: https://issues.apache.org/jira/browse/SPARK-45632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> If the cache serializer supports columnar input, then we do not need a 
> ColumnarToRow before cache data. This pr improves the optimization with AQE 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE

2023-10-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45632.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43484
[https://github.com/apache/spark/pull/43484]

> Table cache should avoid unnecessary ColumnarToRow when enable AQE
> --
>
> Key: SPARK-45632
> URL: https://issues.apache.org/jira/browse/SPARK-45632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If the cache serializer supports columnar input, then we do not need a 
> ColumnarToRow before cache data. This pr improves the optimization with AQE 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org