[jira] [Created] (SPARK-38326) aditya

2022-02-24 Thread Vallepu Durga Aditya (Jira)
Vallepu Durga Aditya created SPARK-38326:


 Summary: aditya
 Key: SPARK-38326
 URL: https://issues.apache.org/jira/browse/SPARK-38326
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Vallepu Durga Aditya
 Fix For: 3.2.1






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode

2022-02-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38316.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35652
[https://github.com/apache/spark/pull/35652]

> Fix 
> SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite
>  under ANSI mode
> ---
>
> Key: SPARK-38316
> URL: https://issues.apache.org/jira/browse/SPARK-38316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38322.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35658
[https://github.com/apache/spark/pull/35658]

> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38322:
---

Assignee: XiDuo You

> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38317.
--
Resolution: Not A Problem

> Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
> -
>
> Key: SPARK-38317
> URL: https://issues.apache.org/jira/browse/SPARK-38317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Jolan Rensen
>Priority: Major
>
> {code}
> val dates = Seq(
>     Period.ZERO,
>     Period.ofWeeks(2),
> ).toDS()
> dates.show(false)
> {code}
> Results in:
> {code}
> ++
> |value   |
> ++
> |INTERVAL '0-0' YEAR TO MONTH|
> |INTERVAL '0-0' YEAR TO MONTH|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497926#comment-17497926
 ] 

Max Gekk commented on SPARK-38317:
--

This is the expected behavior, Spark truncates java.time.Period to months.

> Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
> -
>
> Key: SPARK-38317
> URL: https://issues.apache.org/jira/browse/SPARK-38317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Jolan Rensen
>Priority: Major
>
> {code}
> val dates = Seq(
>     Period.ZERO,
>     Period.ofWeeks(2),
> ).toDS()
> dates.show(false)
> {code}
> Results in:
> {code}
> ++
> |value   |
> ++
> |INTERVAL '0-0' YEAR TO MONTH|
> |INTERVAL '0-0' YEAR TO MONTH|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38189:


Assignee: Apache Spark

> Support priority scheduling (Introduce priorityClass) with volcano 
> implementations
> --
>
> Key: SPARK-38189
> URL: https://issues.apache.org/jira/browse/SPARK-38189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497924#comment-17497924
 ] 

Apache Spark commented on SPARK-38189:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35639

> Support priority scheduling (Introduce priorityClass) with volcano 
> implementations
> --
>
> Key: SPARK-38189
> URL: https://issues.apache.org/jira/browse/SPARK-38189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38189:


Assignee: (was: Apache Spark)

> Support priority scheduling (Introduce priorityClass) with volcano 
> implementations
> --
>
> Key: SPARK-38189
> URL: https://issues.apache.org/jira/browse/SPARK-38189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38323) Support the hidden file metadata in Streaming

2022-02-24 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-38323:

Description: 
Currently, querying the hidden file metadata struct `_metadata` will fail with 
`readStream`, `writeStream` APIs.
{code:java}
spark
  .readStream
  ...
  .select("_metadata")
  .writeStream
  ...
  .start(){code}
Need to expose the metadata output to `StreamingRelation` as well.

> Support the hidden file metadata in Streaming
> -
>
> Key: SPARK-38323
> URL: https://issues.apache.org/jira/browse/SPARK-38323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Currently, querying the hidden file metadata struct `_metadata` will fail 
> with `readStream`, `writeStream` APIs.
> {code:java}
> spark
>   .readStream
>   ...
>   .select("_metadata")
>   .writeStream
>   ...
>   .start(){code}
> Need to expose the metadata output to `StreamingRelation` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497919#comment-17497919
 ] 

Apache Spark commented on SPARK-38325:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35659

> ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() 
> 
>
> Key: SPARK-38325
> URL: https://issues.apache.org/jira/browse/SPARK-38325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> SubqueryBroadcastExec retrieves the partition key from the broadcast results 
> based on the type of HashedRelation returned. If the key is packed inside a 
> Long, we extract it through bitwise operations and cast it as Byte/Short/Int 
> if necessary.
> The casting here can cause a potential runtime error. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38325:


Assignee: Gengliang Wang  (was: Apache Spark)

> ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() 
> 
>
> Key: SPARK-38325
> URL: https://issues.apache.org/jira/browse/SPARK-38325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> SubqueryBroadcastExec retrieves the partition key from the broadcast results 
> based on the type of HashedRelation returned. If the key is packed inside a 
> Long, we extract it through bitwise operations and cast it as Byte/Short/Int 
> if necessary.
> The casting here can cause a potential runtime error. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38325:


Assignee: Apache Spark  (was: Gengliang Wang)

> ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() 
> 
>
> Key: SPARK-38325
> URL: https://issues.apache.org/jira/browse/SPARK-38325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> SubqueryBroadcastExec retrieves the partition key from the broadcast results 
> based on the type of HashedRelation returned. If the key is packed inside a 
> Long, we extract it through bitwise operations and cast it as Byte/Short/Int 
> if necessary.
> The casting here can cause a potential runtime error. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()

2022-02-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38325:
--

 Summary: ANSI mode: avoid potential runtime error in 
HashJoin.extractKeyExprAt() 
 Key: SPARK-38325
 URL: https://issues.apache.org/jira/browse/SPARK-38325
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0, 3.2.2
Reporter: Gengliang Wang
Assignee: Gengliang Wang


SubqueryBroadcastExec retrieves the partition key from the broadcast results 
based on the type of HashedRelation returned. If the key is packed inside a 
Long, we extract it through bitwise operations and cast it as Byte/Short/Int if 
necessary.

The casting here can cause a potential runtime error. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2022-02-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497905#comment-17497905
 ] 

Hyukjin Kwon commented on SPARK-38324:
--

cc [~Gengliang.Wang] FYI

> The second range is not [0, 59] in the day time ANSI interval
> -
>
> Key: SPARK-38324
> URL: https://issues.apache.org/jira/browse/SPARK-38324
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot
>Reporter: chong
>Priority: Major
>
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  * SECOND, seconds within minutes and possibly fractions of a second 
> [0..59.99]{{{}{}}}
> {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}
>  
> But testing shows 99 second is valid:
> {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
> {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
> second]{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497904#comment-17497904
 ] 

Hyukjin Kwon commented on SPARK-38317:
--

cc [~maxgekk] FYI

> Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
> -
>
> Key: SPARK-38317
> URL: https://issues.apache.org/jira/browse/SPARK-38317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Jolan Rensen
>Priority: Major
>
> {code}
> val dates = Seq(
>     Period.ZERO,
>     Period.ofWeeks(2),
> ).toDS()
> dates.show(false)
> {code}
> Results in:
> {code}
> ++
> |value   |
> ++
> |INTERVAL '0-0' YEAR TO MONTH|
> |INTERVAL '0-0' YEAR TO MONTH|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38317:
-
Description: 
{code}
val dates = Seq(
    Period.ZERO,
    Period.ofWeeks(2),
).toDS()
dates.show(false)
{code}

Results in:

{code}
++
|value   |
++
|INTERVAL '0-0' YEAR TO MONTH|
|INTERVAL '0-0' YEAR TO MONTH|
++
{code}


  was:
```val dates = Seq(
    Period.ZERO,
    Period.ofWeeks(2),
).toDS()
dates.show(false)```

Results in:
```
++
|value   |
++
|INTERVAL '0-0' YEAR TO MONTH|
|INTERVAL '0-0' YEAR TO MONTH|
++
```


> Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
> -
>
> Key: SPARK-38317
> URL: https://issues.apache.org/jira/browse/SPARK-38317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Jolan Rensen
>Priority: Major
>
> {code}
> val dates = Seq(
>     Period.ZERO,
>     Period.ofWeeks(2),
> ).toDS()
> dates.show(false)
> {code}
> Results in:
> {code}
> ++
> |value   |
> ++
> |INTERVAL '0-0' YEAR TO MONTH|
> |INTERVAL '0-0' YEAR TO MONTH|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38323) Support the hidden file metadata in Streaming

2022-02-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497903#comment-17497903
 ] 

Hyukjin Kwon commented on SPARK-38323:
--

[~yaohua] mind fill the description?

> Support the hidden file metadata in Streaming
> -
>
> Key: SPARK-38323
> URL: https://issues.apache.org/jira/browse/SPARK-38323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38317:
-
Component/s: SQL
 (was: Spark Core)

> Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
> -
>
> Key: SPARK-38317
> URL: https://issues.apache.org/jira/browse/SPARK-38317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Jolan Rensen
>Priority: Major
>
> ```val dates = Seq(
>     Period.ZERO,
>     Period.ofWeeks(2),
> ).toDS()
> dates.show(false)```
> Results in:
> ```
> ++
> |value   |
> ++
> |INTERVAL '0-0' YEAR TO MONTH|
> |INTERVAL '0-0' YEAR TO MONTH|
> ++
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37614) Support ANSI Aggregate Function: regr_avgx & regr_avgy

2022-02-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37614.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34868
[https://github.com/apache/spark/pull/34868]

> Support ANSI Aggregate Function: regr_avgx & regr_avgy
> --
>
> Key: SPARK-37614
> URL: https://issues.apache.org/jira/browse/SPARK-37614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> REGR_AVGX and REGR_AVGY are ANSI aggregate functions. many database support 
> it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37614) Support ANSI Aggregate Function: regr_avgx & regr_avgy

2022-02-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37614:
---

Assignee: jiaan.geng

> Support ANSI Aggregate Function: regr_avgx & regr_avgy
> --
>
> Key: SPARK-37614
> URL: https://issues.apache.org/jira/browse/SPARK-37614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> REGR_AVGX and REGR_AVGY are ANSI aggregate functions. many database support 
> it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38322:


Assignee: (was: Apache Spark)

> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497899#comment-17497899
 ] 

Apache Spark commented on SPARK-38322:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35658

> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38322:


Assignee: Apache Spark

> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2022-02-24 Thread chong (Jira)
chong created SPARK-38324:
-

 Summary: The second range is not [0, 59] in the day time ANSI 
interval
 Key: SPARK-38324
 URL: https://issues.apache.org/jira/browse/SPARK-38324
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.3.0
 Environment: Spark 3.3.0 snapshot
Reporter: chong


[https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
 * SECOND, seconds within minutes and possibly fractions of a second 
[0..59.99]{{{}{}}}

{{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}

 

But testing shows 99 second is valid:

{{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
{{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
second]{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38323) Support the hidden file metadata in Streaming

2022-02-24 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38323:
---

 Summary: Support the hidden file metadata in Streaming
 Key: SPARK-38323
 URL: https://issues.apache.org/jira/browse/SPARK-38323
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38298) Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, complexTypesSuite, CastSuite under ANSI mode

2022-02-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38298.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35618
[https://github.com/apache/spark/pull/35618]

> Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, 
> complexTypesSuite, CastSuite under ANSI mode
> ---
>
> Key: SPARK-38298
> URL: https://issues.apache.org/jira/browse/SPARK-38298
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38298) Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, complexTypesSuite, CastSuite under ANSI mode

2022-02-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38298:
--

Assignee: Xinyi Yu

> Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, 
> complexTypesSuite, CastSuite under ANSI mode
> ---
>
> Key: SPARK-38298
> URL: https://issues.apache.org/jira/browse/SPARK-38298
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38322:
--
Description: 
The formatted explalin mode is the powerful explain mode to show the details of 
query plan. In AQE, the query stage know its statistics if has already 
materialized. So it can help to quick check the conversion of plan, e.g. join 
selection. 

A simple example:
{code:java}
SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (21)
+- == Final Plan ==
   * SortMergeJoin Inner (13)
   :- * Sort (6)
   :  +- AQEShuffleRead (5)
   : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
   :+- Exchange (3)
   :   +- * Filter (2)
   :  +- Scan hive default.t (1)
   +- * Sort (12)
  +- AQEShuffleRead (11)
 +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Exchange (9)
   +- * Filter (8)
  +- Scan hive default.t2 (7)
+- == Initial Plan ==
   SortMergeJoin Inner (20)
   :- Sort (16)
   :  +- Exchange (15)
   : +- Filter (14)
   :+- Scan hive default.t (1)
   +- Sort (19)
  +- Exchange (18)
 +- Filter (17)
+- Scan hive default.t2 (7){code}
 

 

  was:
The formatted explalin mode is the powerful explain mode to show the details of 
query plan. In AQE, the query stage know its statistics if has already 
materialized. So it can help to quick check the conversion of plan, e.g. join 
selection. 

 

A simple example:
{code:java}
SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
 

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (21)
+- == Final Plan ==
   * SortMergeJoin Inner (13)
   :- * Sort (6)
   :  +- AQEShuffleRead (5)
   : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
   :+- Exchange (3)
   :   +- * Filter (2)
   :  +- Scan hive default.t (1)
   +- * Sort (12)
  +- AQEShuffleRead (11)
 +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Exchange (9)
   +- * Filter (8)
  +- Scan hive default.t2 (7)
+- == Initial Plan ==
   SortMergeJoin Inner (20)
   :- Sort (16)
   :  +- Exchange (15)
   : +- Filter (14)
   :+- Scan hive default.t (1)
   +- Sort (19)
  +- Exchange (18)
 +- Filter (17)
+- Scan hive default.t2 (7){code}
 

 


> Support query stage show runtime statistics in formatted explain mode
> -
>
> Key: SPARK-38322
> URL: https://issues.apache.org/jira/browse/SPARK-38322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The formatted explalin mode is the powerful explain mode to show the details 
> of query plan. In AQE, the query stage know its statistics if has already 
> materialized. So it can help to quick check the conversion of plan, e.g. join 
> selection. 
> A simple example:
> {code:java}
> SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (21)
> +- == Final Plan ==
>* SortMergeJoin Inner (13)
>:- * Sort (6)
>:  +- AQEShuffleRead (5)
>: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
>:+- Exchange (3)
>:   +- * Filter (2)
>:  +- Scan hive default.t (1)
>+- * Sort (12)
>   +- AQEShuffleRead (11)
>  +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Exchange (9)
>+- * Filter (8)
>   +- Scan hive default.t2 (7)
> +- == Initial Plan ==
>SortMergeJoin Inner (20)
>:- Sort (16)
>:  +- Exchange (15)
>: +- Filter (14)
>:+- Scan hive default.t (1)
>+- Sort (19)
>   +- Exchange (18)
>  +- Filter (17)
> +- Scan hive default.t2 (7){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode

2022-02-24 Thread XiDuo You (Jira)
XiDuo You created SPARK-38322:
-

 Summary: Support query stage show runtime statistics in formatted 
explain mode
 Key: SPARK-38322
 URL: https://issues.apache.org/jira/browse/SPARK-38322
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


The formatted explalin mode is the powerful explain mode to show the details of 
query plan. In AQE, the query stage know its statistics if has already 
materialized. So it can help to quick check the conversion of plan, e.g. join 
selection. 

 

A simple example:
{code:java}
SELECT * FROM t JOIN t2 ON t.c = t2.c;{code}
 

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (21)
+- == Final Plan ==
   * SortMergeJoin Inner (13)
   :- * Sort (6)
   :  +- AQEShuffleRead (5)
   : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1)
   :+- Exchange (3)
   :   +- * Filter (2)
   :  +- Scan hive default.t (1)
   +- * Sort (12)
  +- AQEShuffleRead (11)
 +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Exchange (9)
   +- * Filter (8)
  +- Scan hive default.t2 (7)
+- == Initial Plan ==
   SortMergeJoin Inner (20)
   :- Sort (16)
   :  +- Exchange (15)
   : +- Filter (14)
   :+- Scan hive default.t (1)
   +- Sort (19)
  +- Exchange (18)
 +- Filter (17)
+- Scan hive default.t2 (7){code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38311) Fix DynamicPartitionPruning/BucketedReadSuite/ExpressionInfoSuite under ANSI mode

2022-02-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38311.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35644
[https://github.com/apache/spark/pull/35644]

> Fix DynamicPartitionPruning/BucketedReadSuite/ExpressionInfoSuite under ANSI 
> mode
> -
>
> Key: SPARK-38311
> URL: https://issues.apache.org/jira/browse/SPARK-38311
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38275) Consider to include WriteBatch's memory in the memory usage of RocksDB state store

2022-02-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38275.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35600
[https://github.com/apache/spark/pull/35600]

> Consider to include WriteBatch's memory in the memory usage of RocksDB state 
> store
> --
>
> Key: SPARK-38275
> URL: https://issues.apache.org/jira/browse/SPARK-38275
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
> Fix For: 3.3.0
>
>
> Current RocksDB state store actually use a unlimited {{WriteBatch}} with a 
> DB, the {{WriteBatch}} would not be cleared until the micro-batch data 
> committed, which results that the memoy usage of {{WriteBatch}} could be very 
> huge.
> We should consider to add the approximate memory usgae of WriteBatch as the 
> totdal memory usage and also print it separately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38275) Consider to include WriteBatch's memory in the memory usage of RocksDB state store

2022-02-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38275:


Assignee: Yun Tang

> Consider to include WriteBatch's memory in the memory usage of RocksDB state 
> store
> --
>
> Key: SPARK-38275
> URL: https://issues.apache.org/jira/browse/SPARK-38275
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>
> Current RocksDB state store actually use a unlimited {{WriteBatch}} with a 
> DB, the {{WriteBatch}} would not be cleared until the micro-batch data 
> committed, which results that the memoy usage of {{WriteBatch}} could be very 
> huge.
> We should consider to add the approximate memory usgae of WriteBatch as the 
> totdal memory usage and also print it separately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38172) Adaptive coalesce not working with df persist

2022-02-24 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497870#comment-17497870
 ] 

XiDuo You commented on SPARK-38172:
---

thanks [~Naveenmts]  for the confirming !

> Adaptive coalesce not working with df persist
> -
>
> Key: SPARK-38172
> URL: https://issues.apache.org/jira/browse/SPARK-38172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: OS: Linux
> Spark Version: 3.2.3
>Reporter: Naveen Nagaraj
>Priority: Major
> Attachments: image-2022-02-10-15-32-30-355.png, 
> image-2022-02-10-15-33-08-018.png, image-2022-02-10-15-33-32-607.png
>
>
> {code:java}
> // code placeholder
> val spark = SparkSession.builder().master("local[4]").appName("Test")
>                         .config("spark.sql.adaptive.enabled", "true")
>                         
> .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>                         
> .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
>                         
> .config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
>                         
> .config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
>                         .getOrCreate()
> val df = spark.read.csv("")
> val df1 = df.distinct()
> df1.persist() // On removing this line. Code works as expected
> df1.write.csv("") {code}
> Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each 
> which is expected
> [https://i.stack.imgur.com/tDxpV.png]
> If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce 
> not working) With persist
> [https://i.stack.imgur.com/W13hA.png]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38172) Adaptive coalesce not working with df persist

2022-02-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-38172.
---
Resolution: Won't Fix

> Adaptive coalesce not working with df persist
> -
>
> Key: SPARK-38172
> URL: https://issues.apache.org/jira/browse/SPARK-38172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: OS: Linux
> Spark Version: 3.2.3
>Reporter: Naveen Nagaraj
>Priority: Major
> Attachments: image-2022-02-10-15-32-30-355.png, 
> image-2022-02-10-15-33-08-018.png, image-2022-02-10-15-33-32-607.png
>
>
> {code:java}
> // code placeholder
> val spark = SparkSession.builder().master("local[4]").appName("Test")
>                         .config("spark.sql.adaptive.enabled", "true")
>                         
> .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>                         
> .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
>                         
> .config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
>                         
> .config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
>                         .getOrCreate()
> val df = spark.read.csv("")
> val df1 = df.distinct()
> df1.persist() // On removing this line. Code works as expected
> df1.write.csv("") {code}
> Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each 
> which is expected
> [https://i.stack.imgur.com/tDxpV.png]
> If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce 
> not working) With persist
> [https://i.stack.imgur.com/W13hA.png]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38303) Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev

2022-02-24 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-38303.

Fix Version/s: 3.3.0
   3.2.2
 Assignee: Bjørn Jørgensen
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/35628

> Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
> --
>
> Key: SPARK-38303
> URL: https://issues.apache.org/jira/browse/SPARK-38303
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> [CVE-2021-3807|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3807]
>   
> [releases notes at github|https://github.com/chalk/ansi-regex/releases]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38303) Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev

2022-02-24 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-38303:
---
Affects Version/s: 3.2.1

> Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
> --
>
> Key: SPARK-38303
> URL: https://issues.apache.org/jira/browse/SPARK-38303
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2021-3807|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3807]
>   
> [releases notes at github|https://github.com/chalk/ansi-regex/releases]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz

2022-02-24 Thread qian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497833#comment-17497833
 ] 

qian commented on SPARK-38302:
--

[~dongjoon] Thanks for your work :)

> Use Java 17 in K8S integration tests when setting spark-tgz
> ---
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

2022-02-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38191.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35492
[https://github.com/apache/spark/pull/35492]

> The staging directory of write job only needs to be initialized once in 
> HadoopMapReduceCommitProtocol.
> --
>
> Key: SPARK-38191
> URL: https://issues.apache.org/jira/browse/SPARK-38191
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

2022-02-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38191:


Assignee: weixiuli

> The staging directory of write job only needs to be initialized once in 
> HadoopMapReduceCommitProtocol.
> --
>
> Key: SPARK-38191
> URL: https://issues.apache.org/jira/browse/SPARK-38191
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-02-24 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497826#comment-17497826
 ] 

L. C. Hsieh commented on SPARK-38285:
-

Thanks for reporting this. I will take a look.

> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
>   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.Task.run(Task.scala:93)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
> Please note that the issue seems to be related to SPARK-37577: I am using the 
> same DataFrame schema, but this time I have populated it with non empty value.
> I think that this is bug because with the following configuration it works as 
> expected:
> {code:python}
> spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
> 

[jira] [Assigned] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37377:


Assignee: (was: Apache Spark)

> Refactor V2 Partitioning interface and remove deprecated usage of Distribution
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently {{Partitioning}} is defined as follow:
> {code:scala}
> @Evolving
> public interface Partitioning {
>   int numPartitions();
>   boolean satisfy(Distribution distribution);
> }
> {code}
> There are two issues with the interface: 1) it uses a deprecated 
> {{Distribution}} interface, and should switch to 
> {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
> there is no way to use this in join where we want to compare reported 
> partitionings from both sides and decide whether they are "compatible" (and 
> thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37377:


Assignee: Apache Spark

> Refactor V2 Partitioning interface and remove deprecated usage of Distribution
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently {{Partitioning}} is defined as follow:
> {code:scala}
> @Evolving
> public interface Partitioning {
>   int numPartitions();
>   boolean satisfy(Distribution distribution);
> }
> {code}
> There are two issues with the interface: 1) it uses a deprecated 
> {{Distribution}} interface, and should switch to 
> {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
> there is no way to use this in join where we want to compare reported 
> partitionings from both sides and decide whether they are "compatible" (and 
> thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497819#comment-17497819
 ] 

Apache Spark commented on SPARK-37377:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/35657

> Refactor V2 Partitioning interface and remove deprecated usage of Distribution
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently {{Partitioning}} is defined as follow:
> {code:scala}
> @Evolving
> public interface Partitioning {
>   int numPartitions();
>   boolean satisfy(Distribution distribution);
> }
> {code}
> There are two issues with the interface: 1) it uses a deprecated 
> {{Distribution}} interface, and should switch to 
> {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
> there is no way to use this in join where we want to compare reported 
> partitionings from both sides and decide whether they are "compatible" (and 
> thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38107:


Assignee: (was: Apache Spark)

> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> * usePythonUDFInJoinConditionUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38107:


Assignee: Apache Spark

> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> * usePythonUDFInJoinConditionUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497799#comment-17497799
 ] 

Apache Spark commented on SPARK-38107:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/35656

> Use error classes in the compilation errors of python/pandas UDFs
> -
>
> Key: SPARK-38107
> URL: https://issues.apache.org/jira/browse/SPARK-38107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * pandasUDFAggregateNotSupportedInPivotError
> * groupAggPandasUDFUnsupportedByStreamingAggError
> * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError
> * usePythonUDFInJoinConditionUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497740#comment-17497740
 ] 

Apache Spark commented on SPARK-38315:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35655

> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), in particular, and allow to 
> enable/disable to Java 8 types in the Thrift server. The config should solve 
> the following issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497738#comment-17497738
 ] 

Apache Spark commented on SPARK-38315:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35655

> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), in particular, and allow to 
> enable/disable to Java 8 types in the Thrift server. The config should solve 
> the following issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38315:


Assignee: Max Gekk  (was: Apache Spark)

> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), in particular, and allow to 
> enable/disable to Java 8 types in the Thrift server. The config should solve 
> the following issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38315:


Assignee: Apache Spark  (was: Max Gekk)

> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add new config that should control collect(), in particular, and allow to 
> enable/disable to Java 8 types in the Thrift server. The config should solve 
> the following issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38315:
-
Summary: Add a config to control decoding of datetime as Java 8 classes  
(was: Add a config to collect objects as Java 8 types in the Thrift server)

> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), and allow to enable/disable to 
> Java 8 types in the Thrift server. The config should solve the following 
> issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes

2022-02-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38315:
-
Description: 
Add new config that should control collect(), in particular, and allow to 
enable/disable to Java 8 types in the Thrift server. The config should solve 
the following issue:

When an user connects to the Thrift Server and a query involve a datasource 
connect which doesn't handle Java8 types, the user observes the following 
exception:

{code:java}
ERROR SparkExecuteStatementOperation: Error executing query with 
ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: 
java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for 
schema of timestamp  
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
TimestampType, instantToMicros, 
validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
false) AS loan_perf_date#1125  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
  
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)  
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  
{code}

 

  was:
Add new config that should control collect(), and allow to enable/disable to 
Java 8 types in the Thrift server. The config should solve the following issue:

When an user connects to the Thrift Server and a query involve a datasource 
connect which doesn't handle Java8 types, the user observes the following 
exception:

{code:java}
ERROR SparkExecuteStatementOperation: Error executing query with 
ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: 
java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for 
schema of timestamp  
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
TimestampType, instantToMicros, 
validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
false) AS loan_perf_date#1125  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
  
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)  
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  
{code}

 


> Add a config to control decoding of datetime as Java 8 classes
> --
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), in particular, and allow to 
> enable/disable to Java 8 types in the Thrift server. The config should solve 
> the following issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else 

[jira] [Assigned] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38302:
-

Assignee: qian

> Use Java 17 in K8S integration tests when setting spark-tgz
> ---
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38302:
--
Summary: Use Java 17 in K8S integration tests when setting spark-tgz  (was: 
Dockerfile.java17 can't be used in K8s integration tests when )

> Use Java 17 in K8S integration tests when setting spark-tgz
> ---
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Minor
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38302) Dockerfile.java17 can't be used in K8s integration tests when

2022-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497676#comment-17497676
 ] 

Dongjoon Hyun commented on SPARK-38302:
---

I collected this to a subtask of SPARK-33772 to give more visibility to your 
issue, [~dcoliversun].

> Dockerfile.java17 can't be used in K8s integration tests when 
> --
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Minor
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38302) Dockerfile.java17 can't be used in K8s integration tests when

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38302:
--
Parent: SPARK-33772
Issue Type: Sub-task  (was: Improvement)

> Dockerfile.java17 can't be used in K8s integration tests when 
> --
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Minor
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-02-24 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497662#comment-17497662
 ] 

Bruce Robbins edited comment on SPARK-38285 at 2/24/22, 7:19 PM:
-

I see your point.

It appears to be caused by [this 
commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc 
[~viirya]

Before that commit, this works:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

select eo.b.e from (select explode(o) as eo from v1);
{noformat}
It produces:
{noformat}
["string2","string3"]
["string5","string6"]
{noformat}
After that commit, you instead get the following error:
{noformat}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
{noformat}
You can bypass the error by caching the {{{}explode{}}}. For example, this 
works even after SPARK-34638:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

create or replace temporary view v2 as select explode(o) as eo from v1;
cache table v2;
select eo.b.e from v2;
{noformat}
Also you can bypass the error by turning off 
{{spark.sql.optimizer.expression.nestedPruning.enabled}} and 
{{{}spark.sql.optimizer.nestedSchemaPruning.enabled{}}}, as [~allebacco] 
mentioned above.


was (Author: bersprockets):
I see your point.

It appears to be caused by [this 
commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc 
[~viirya]

Before that commit, this works:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

select eo.b.e from (select explode(o) as eo from v1);
{noformat}
It produces:
{noformat}
["string2","string3"]
["string5","string6"]
{noformat}
On or after that commit, you instead get the following error:
{noformat}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
{noformat}
You can bypass the error by caching the {{explode}}. For example, this works 
even after SPARK-34638:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

create or replace temporary view v2 as select explode(o) as eo from v1;
cache table v2;
select eo.b.e from v2;
{noformat}
Also you can bypass the error by turning off 
{{spark.sql.optimizer.expression.nestedPruning.enabled}} and 
{{spark.sql.optimizer.nestedSchemaPruning.enabled}}, as [~allebacco] mentioned 
above.



> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>        

[jira] [Commented] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497673#comment-17497673
 ] 

Apache Spark commented on SPARK-38321:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35654

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38321:


Assignee: (was: Apache Spark)

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38321:


Assignee: Apache Spark

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497672#comment-17497672
 ] 

Apache Spark commented on SPARK-38321:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35654

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-24 Thread Xinyi Yu (Jira)
Xinyi Yu created SPARK-38321:


 Summary: Fix BooleanSimplificationSuite under ANSI mode
 Key: SPARK-38321
 URL: https://issues.apache.org/jira/browse/SPARK-38321
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinyi Yu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-02-24 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497662#comment-17497662
 ] 

Bruce Robbins commented on SPARK-38285:
---

I see your point.

It appears to be caused by [this 
commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc 
[~viirya]

Before that commit, this works:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

select eo.b.e from (select explode(o) as eo from v1);
{noformat}
It produces:
{noformat}
["string2","string3"]
["string5","string6"]
{noformat}
On or after that commit, you instead get the following error:
{noformat}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
{noformat}
You can bypass the error by caching the {{explode}}. For example, this works 
even after SPARK-34638:
{noformat}
create or replace temp view v1 as
select * from values
(array(
  named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), 
named_struct('e', 'string3'))),
  named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), 
named_struct('e', 'string6')))
  )
)
v1(o);

create or replace temporary view v2 as select explode(o) as eo from v1;
cache table v2;
select eo.b.e from v2;
{noformat}
Also you can bypass the error by turning off 
{{spark.sql.optimizer.expression.nestedPruning.enabled}} and 
{{spark.sql.optimizer.nestedSchemaPruning.enabled}}, as [~allebacco] mentioned 
above.



> ClassCastException: GenericArrayData cannot be cast to InternalRow
> --
>
> Key: SPARK-38285
> URL: https://issues.apache.org/jira/browse/SPARK-38285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Alessandro Bacchini
>Priority: Major
>
> The following code with Spark 3.2.1 raises an exception:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([
>     StructField('o', 
>         ArrayType(
>             StructType([
>                 StructField('s', StringType(), False),
>                 StructField('b', ArrayType(
>                     StructType([
>                         StructField('e', StringType(), False)
>                     ]),
>                     True),
>                 False)
>             ]), 
>         True),
>     False)])
> value = {
>     "o": [
>         {
>             "s": "string1",
>             "b": [
>                 {
>                     "e": "string2"
>                 },
>                 {
>                     "e": "string3"
>                 }
>             ]
>         },
>         {
>             "s": "string4",
>             "b": [
>                 {
>                     "e": "string5"
>                 },
>                 {
>                     "e": "string6"
>                 },
>                 {
>                     "e": "string7"
>                 }
>             ]
>         }
>     ]
> }
> df = (
>     spark.createDataFrame([value], schema=t)
>     .select(F.explode("o").alias("eo"))
>     .select("eo.b.e")
> )
> df.show()
> {code}
> The exception message is:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at 
> org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
>   at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
>   at 

[jira] [Commented] (SPARK-38318) regression when replacing a dataset view

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497602#comment-17497602
 ] 

Apache Spark commented on SPARK-38318:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/35653

> regression when replacing a dataset view
> 
>
> Key: SPARK-38318
> URL: https://issues.apache.org/jira/browse/SPARK-38318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Linhong Liu
>Priority: Major
>
> The below use case works well in 3.1 but failed in 3.2 and master.
> {code:java}
> sql("select 1").createOrReplaceTempView("v")
> sql("select * from v").createOrReplaceTempView("v")
> // in 3.1 it works well, and select will output 1
> // in 3.2 it failed with error: "AnalysisException: Recursive view v detected 
> (cycle: v -> v)"{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38320) (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch

2022-02-24 Thread Alex Balikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Balikov updated SPARK-38320:
-
Description: 
We have identified an issue where the RocksDB state store iterator will not 
pick up store updates made after its creation. As a result of this, the 
_timeoutProcessorIter_ in

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala]

will not pick up state changes made during _newDataProcessorIter_ input 
processing. The user observed behavior is that a group state may receive input 
records and also be called with timeout in the same micro batch. This 
contradics the public documentation for GroupState -

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html]
 * The timeout is reset every time the function is called on a group, that is, 
when the group has new data, or the group has timed out. So the user has to set 
the timeout duration every time the function is called, otherwise, there will 
not be any timeout set.

  was:
We have identified an issue where the RocksDB state store iterator will not 
pick up store updates made after its creation. As a result of this, the 
_timeoutProcessorIter_ in 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala]

will not pick up state changes made during newDataProcessorIter input 
processing. The user observed behavior is that a group state may receive input 
records and also be called with timeout in the same micro batch. This 
contradics the public documentation for GroupState -

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html

 
 * The timeout is reset every time the function is called on a group, that is, 
when the group has new data, or the group has timed out. So the user has to set 
the timeout duration every time the function is called, otherwise, there will 
not be any timeout set.


> (flat)MapGroupsWithState can timeout groups which just received inputs in the 
> same microbatch
> -
>
> Key: SPARK-38320
> URL: https://issues.apache.org/jira/browse/SPARK-38320
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Alex Balikov
>Priority: Major
>
> We have identified an issue where the RocksDB state store iterator will not 
> pick up store updates made after its creation. As a result of this, the 
> _timeoutProcessorIter_ in
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala]
> will not pick up state changes made during _newDataProcessorIter_ input 
> processing. The user observed behavior is that a group state may receive 
> input records and also be called with timeout in the same micro batch. This 
> contradics the public documentation for GroupState -
> [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html]
>  * The timeout is reset every time the function is called on a group, that 
> is, when the group has new data, or the group has timed out. So the user has 
> to set the timeout duration every time the function is called, otherwise, 
> there will not be any timeout set.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38320) (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch

2022-02-24 Thread Alex Balikov (Jira)
Alex Balikov created SPARK-38320:


 Summary: (flat)MapGroupsWithState can timeout groups which just 
received inputs in the same microbatch
 Key: SPARK-38320
 URL: https://issues.apache.org/jira/browse/SPARK-38320
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: Alex Balikov


We have identified an issue where the RocksDB state store iterator will not 
pick up store updates made after its creation. As a result of this, the 
_timeoutProcessorIter_ in 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala]

will not pick up state changes made during newDataProcessorIter input 
processing. The user observed behavior is that a group state may receive input 
records and also be called with timeout in the same micro batch. This 
contradics the public documentation for GroupState -

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html

 
 * The timeout is reset every time the function is called on a group, that is, 
when the group has new data, or the group has timed out. So the user has to set 
the timeout duration every time the function is called, otherwise, there will 
not be any timeout set.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38318) regression when replacing a dataset view

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38318:


Assignee: (was: Apache Spark)

> regression when replacing a dataset view
> 
>
> Key: SPARK-38318
> URL: https://issues.apache.org/jira/browse/SPARK-38318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Linhong Liu
>Priority: Major
>
> The below use case works well in 3.1 but failed in 3.2 and master.
> {code:java}
> sql("select 1").createOrReplaceTempView("v")
> sql("select * from v").createOrReplaceTempView("v")
> // in 3.1 it works well, and select will output 1
> // in 3.2 it failed with error: "AnalysisException: Recursive view v detected 
> (cycle: v -> v)"{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38318) regression when replacing a dataset view

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38318:


Assignee: Apache Spark

> regression when replacing a dataset view
> 
>
> Key: SPARK-38318
> URL: https://issues.apache.org/jira/browse/SPARK-38318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> The below use case works well in 3.1 but failed in 3.2 and master.
> {code:java}
> sql("select 1").createOrReplaceTempView("v")
> sql("select * from v").createOrReplaceTempView("v")
> // in 3.1 it works well, and select will output 1
> // in 3.2 it failed with error: "AnalysisException: Recursive view v detected 
> (cycle: v -> v)"{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38318) regression when replacing a dataset view

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497601#comment-17497601
 ] 

Apache Spark commented on SPARK-38318:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/35653

> regression when replacing a dataset view
> 
>
> Key: SPARK-38318
> URL: https://issues.apache.org/jira/browse/SPARK-38318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Linhong Liu
>Priority: Major
>
> The below use case works well in 3.1 but failed in 3.2 and master.
> {code:java}
> sql("select 1").createOrReplaceTempView("v")
> sql("select * from v").createOrReplaceTempView("v")
> // in 3.1 it works well, and select will output 1
> // in 3.2 it failed with error: "AnalysisException: Recursive view v detected 
> (cycle: v -> v)"{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS

2022-02-24 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497598#comment-17497598
 ] 

Erik Krogen commented on SPARK-37318:
-

Great point [~dongjoon], thanks for pointing it out!

> Make FallbackStorageSuite robust in terms of DNS
> 
>
> Key: SPARK-37318
> URL: https://issues.apache.org/jira/browse/SPARK-37318
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> Usually, the test case expects the hostname doesn't exist.
> {code}
> $ ping remote
> ping: cannot resolve remote: Unknown host
> {code}
> In some DNS environments, it returns always.
> {code}
> $ ping remote
> PING remote (23.217.138.110): 56 data bytes
> 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38319) Implement Strict Mode to prevent QUERY the entire table

2022-02-24 Thread dimtiris kanoute (Jira)
dimtiris kanoute created SPARK-38319:


 Summary: Implement Strict Mode to prevent QUERY the entire table  
 Key: SPARK-38319
 URL: https://issues.apache.org/jira/browse/SPARK-38319
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: dimtiris kanoute


We are using Spark Thrift Server as a service to run Spark SQL queries along 
with Hive metastore as the metadata service.

We would like to restrict users from querying the entire table and force them 
to use {{WHERE  }}clause in the query based on partition column{{ (i.e. SELECT 
* FROM TABLE WHERE partition_column=) }}*and*  {{LIMIT }}the 
output of the query when {{ORDER}} {{BY}} is used.

This behaviour is similar to what hive exposes as configuration

{{??hive.strict.checks.no.partition.filter??}}

{{??hive.strict.checks.orderby.no.limit??}}

and is described here: 
[https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1812|http://example.com/]

and

[https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1816|http://example.com/]

 

This is a pretty common usecase / feature that we meet in other tools as well,  
like in BigQuery for example: 
[https://cloud.google.com/bigquery/docs/querying-partitioned-tables#require_a_partition_filter_in_queries]
  .

It would be nice to have this feature implemented in Spark when hive support is 
enabled in a spark session. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS

2022-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497578#comment-17497578
 ] 

Dongjoon Hyun commented on SPARK-37318:
---

For a record,
- Apache Spark 3.2.1 ~ 3.2.x has this test case fix.
- For Apache Spark 3.3, SPARK-38062 improvement patch removes the restriction 
and reverted this test code change of SPARK-37318 logically.

> Make FallbackStorageSuite robust in terms of DNS
> 
>
> Key: SPARK-37318
> URL: https://issues.apache.org/jira/browse/SPARK-37318
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> Usually, the test case expects the hostname doesn't exist.
> {code}
> $ ping remote
> ping: cannot resolve remote: Unknown host
> {code}
> In some DNS environments, it returns always.
> {code}
> $ ping remote
> PING remote (23.217.138.110): 56 data bytes
> 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS

2022-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497576#comment-17497576
 ] 

Dongjoon Hyun commented on SPARK-37318:
---

It's wrong in branch-3.2, [~xkrogen]. Please be careful about the Affected 
Versions.
bq. Note that the changes in this PR were reverted in SPARK-38062, in favor of 
a solution which fixes the production code rather than disabling the test case 
in certain environments.

> Make FallbackStorageSuite robust in terms of DNS
> 
>
> Key: SPARK-37318
> URL: https://issues.apache.org/jira/browse/SPARK-37318
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> Usually, the test case expects the hostname doesn't exist.
> {code}
> $ ping remote
> ping: cannot resolve remote: Unknown host
> {code}
> In some DNS environments, it returns always.
> {code}
> $ ping remote
> PING remote (23.217.138.110): 56 data bytes
> 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38318) regression when replacing a dataset view

2022-02-24 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-38318:
---

 Summary: regression when replacing a dataset view
 Key: SPARK-38318
 URL: https://issues.apache.org/jira/browse/SPARK-38318
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.1, 3.2.0, 3.3.0
Reporter: Linhong Liu


The below use case works well in 3.1 but failed in 3.2 and master.
{code:java}
sql("select 1").createOrReplaceTempView("v")
sql("select * from v").createOrReplaceTempView("v")
// in 3.1 it works well, and select will output 1
// in 3.2 it failed with error: "AnalysisException: Recursive view v detected 
(cycle: v -> v)"{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38273) Native memory leak in SparkPlan

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38273.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35613
[https://github.com/apache/spark/pull/35613]

> Native memory leak in SparkPlan
> ---
>
> Key: SPARK-38273
> URL: https://issues.apache.org/jira/browse/SPARK-38273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Kevin Sewell
>Assignee: Kevin Sewell
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. 
> This meant that all usages of `CompressionCodec.compressedInputStream` would 
> need to manually close the stream as this would no longer be handled by GC 
> finaliser mechanism.
> In SparkPlan, the result of `CompressionCodec.compressedInputStream` is 
> wrapped in an Iterator which never calls close. This implementation needs to 
> make use of NextIterator which allows for the closing of underlying streams.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38273) decodeUnsafeRows's iterators should close underl… …ying input streams

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38273:
--
Summary: decodeUnsafeRows's iterators should close underl… …ying input 
streams  (was: Native memory leak in SparkPlan)

> decodeUnsafeRows's iterators should close underl… …ying input streams
> -
>
> Key: SPARK-38273
> URL: https://issues.apache.org/jira/browse/SPARK-38273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Kevin Sewell
>Assignee: Kevin Sewell
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. 
> This meant that all usages of `CompressionCodec.compressedInputStream` would 
> need to manually close the stream as this would no longer be handled by GC 
> finaliser mechanism.
> In SparkPlan, the result of `CompressionCodec.compressedInputStream` is 
> wrapped in an Iterator which never calls close. This implementation needs to 
> make use of NextIterator which allows for the closing of underlying streams.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38273) decodeUnsafeRows's iterators should close underlying input streams

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38273:
--
Summary: decodeUnsafeRows's iterators should close underlying input streams 
 (was: decodeUnsafeRows's iterators should close underl… …ying input streams)

> decodeUnsafeRows's iterators should close underlying input streams
> --
>
> Key: SPARK-38273
> URL: https://issues.apache.org/jira/browse/SPARK-38273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Kevin Sewell
>Assignee: Kevin Sewell
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. 
> This meant that all usages of `CompressionCodec.compressedInputStream` would 
> need to manually close the stream as this would no longer be handled by GC 
> finaliser mechanism.
> In SparkPlan, the result of `CompressionCodec.compressedInputStream` is 
> wrapped in an Iterator which never calls close. This implementation needs to 
> make use of NextIterator which allows for the closing of underlying streams.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38273) Native memory leak in SparkPlan

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38273:
-

Assignee: Kevin Sewell

> Native memory leak in SparkPlan
> ---
>
> Key: SPARK-38273
> URL: https://issues.apache.org/jira/browse/SPARK-38273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Kevin Sewell
>Assignee: Kevin Sewell
>Priority: Major
>
> SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. 
> This meant that all usages of `CompressionCodec.compressedInputStream` would 
> need to manually close the stream as this would no longer be handled by GC 
> finaliser mechanism.
> In SparkPlan, the result of `CompressionCodec.compressedInputStream` is 
> wrapped in an Iterator which never calls close. This implementation needs to 
> make use of NextIterator which allows for the closing of underlying streams.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38300) Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up duplicate codes

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38300.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35622
[https://github.com/apache/spark/pull/35622]

> Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up 
> duplicate codes
> --
>
> Key: SPARK-38300
> URL: https://issues.apache.org/jira/browse/SPARK-38300
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38300) Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in catalyst.uti

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38300:
--
Summary: Use ByteStreams.toByteArray to simplify fileToString and 
resourceToBytes in catalyst.uti  (was: Refactor `fileToString` and 
`resourceToBytes` in catalyst.util to clean up duplicate codes)

> Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in 
> catalyst.uti
> 
>
> Key: SPARK-38300
> URL: https://issues.apache.org/jira/browse/SPARK-38300
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38300) Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up duplicate codes

2022-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38300:
-

Assignee: Yang Jie

> Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up 
> duplicate codes
> --
>
> Key: SPARK-38300
> URL: https://issues.apache.org/jira/browse/SPARK-38300
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"

2022-02-24 Thread Jolan Rensen (Jira)
Jolan Rensen created SPARK-38317:


 Summary: Encoding of java.time.Period always results in "INTERVAL 
'0-0' YEAR TO MONTH"
 Key: SPARK-38317
 URL: https://issues.apache.org/jira/browse/SPARK-38317
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1, 3.2.0
Reporter: Jolan Rensen


```val dates = Seq(
    Period.ZERO,
    Period.ofWeeks(2),
).toDS()
dates.show(false)```

Results in:
```
++
|value   |
++
|INTERVAL '0-0' YEAR TO MONTH|
|INTERVAL '0-0' YEAR TO MONTH|
++
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497418#comment-17497418
 ] 

Apache Spark commented on SPARK-38316:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35652

> Fix 
> SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite
>  under ANSI mode
> ---
>
> Key: SPARK-38316
> URL: https://issues.apache.org/jira/browse/SPARK-38316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38316:


Assignee: Gengliang Wang  (was: Apache Spark)

> Fix 
> SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite
>  under ANSI mode
> ---
>
> Key: SPARK-38316
> URL: https://issues.apache.org/jira/browse/SPARK-38316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38316:


Assignee: Apache Spark  (was: Gengliang Wang)

> Fix 
> SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite
>  under ANSI mode
> ---
>
> Key: SPARK-38316
> URL: https://issues.apache.org/jira/browse/SPARK-38316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode

2022-02-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38316:
--

 Summary: Fix 
SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite
 under ANSI mode
 Key: SPARK-38316
 URL: https://issues.apache.org/jira/browse/SPARK-38316
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36194) Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497410#comment-17497410
 ] 

Apache Spark commented on SPARK-36194:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/35651

> Remove the aggregation from left semi/anti join if the same aggregation has 
> already been done on left side
> --
>
> Key: SPARK-36194
> URL: https://issues.apache.org/jira/browse/SPARK-36194
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497407#comment-17497407
 ] 

Apache Spark commented on SPARK-38314:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/35650

> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497406#comment-17497406
 ] 

Apache Spark commented on SPARK-38314:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/35650

> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38314:


Assignee: Apache Spark

> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Assignee: Apache Spark
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38314:


Assignee: (was: Apache Spark)

> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38315) Add a config to collect objects as Java 8 types in the Thrift server

2022-02-24 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497362#comment-17497362
 ] 

Max Gekk commented on SPARK-38315:
--

I am working on this.

> Add a config to collect objects as Java 8 types in the Thrift server
> 
>
> Key: SPARK-38315
> URL: https://issues.apache.org/jira/browse/SPARK-38315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add new config that should control collect(), and allow to enable/disable to 
> Java 8 types in the Thrift server. The config should solve the following 
> issue:
> When an user connects to the Thrift Server and a query involve a datasource 
> connect which doesn't handle Java8 types, the user observes the following 
> exception:
> {code:java}
> ERROR SparkExecuteStatementOperation: Error executing query with 
> ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while 
> encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid 
> external type for schema of timestamp  
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> TimestampType, instantToMicros, 
> validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
> false) AS loan_perf_date#1125  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
>   
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
>   
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38315) Add a config to collect objects as Java 8 types in the Thrift server

2022-02-24 Thread Max Gekk (Jira)
Max Gekk created SPARK-38315:


 Summary: Add a config to collect objects as Java 8 types in the 
Thrift server
 Key: SPARK-38315
 URL: https://issues.apache.org/jira/browse/SPARK-38315
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


Add new config that should control collect(), and allow to enable/disable to 
Java 8 types in the Thrift server. The config should solve the following issue:

When an user connects to the Thrift Server and a query involve a datasource 
connect which doesn't handle Java8 types, the user observes the following 
exception:

{code:java}
ERROR SparkExecuteStatementOperation: Error executing query with 
ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING,  
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: 
java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for 
schema of timestamp  
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
TimestampType, instantToMicros, 
validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, 
false) AS loan_perf_date#1125  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
  
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
  
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)  
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)  
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  
{code}

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497329#comment-17497329
 ] 

Apache Spark commented on SPARK-37932:
--

User 'chenzhx' has created a pull request for this issue:
https://github.com/apache/spark/pull/35649

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497324#comment-17497324
 ] 

Apache Spark commented on SPARK-37932:
--

User 'chenzhx' has created a pull request for this issue:
https://github.com/apache/spark/pull/35649

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-38314:

Description: 
Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:
{code:java}
// prepare a file source df
df.select("*", "_metadata")
  .write.format("parquet").save(path)
spark.read.format("parquet").load(path)
  .select("*").show(){code}

  was:
Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:

```

// prepare a file source df

df.select("*", "_metadata")
  .write.format("parquet").save(path)

spark.read.format("parquet").load(path)
  .select("*").show()

```


> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38314:
---

 Summary: Fail to read parquet files after writing the hidden file 
metadata in
 Key: SPARK-38314
 URL: https://issues.apache.org/jira/browse/SPARK-38314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao


Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:

```

// prepare a file source df

df.select("*", "_metadata")
  .write.format("parquet").save(path)

spark.read.format("parquet").load(path)
  .select("*").show()

```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-02-24 Thread Alessandro Bacchini (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Bacchini updated SPARK-38285:

Description: 
The following code with Spark 3.2.1 raises an exception:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([
    StructField('o', 
        ArrayType(
            StructType([
                StructField('s', StringType(), False),
                StructField('b', ArrayType(
                    StructType([
                        StructField('e', StringType(), False)
                    ]),
                    True),
                False)
            ]), 
        True),
    False)])

value = {
    "o": [
        {
            "s": "string1",
            "b": [
                {
                    "e": "string2"
                },
                {
                    "e": "string3"
                }
            ]
        },
        {
            "s": "string4",
            "b": [
                {
                    "e": "string5"
                },
                {
                    "e": "string6"
                },
                {
                    "e": "string7"
                }
            ]
        }
    ]
}

df = (
    spark.createDataFrame([value], schema=t)
    .select(F.explode("o").alias("eo"))
    .select("eo.b.e")
)


df.show()
{code}

The exception message is:
{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at 
org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
at 
org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at 
org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.

Please note that the issue seems to be related to SPARK-37577: I am using the 
same DataFrame schema, but this time I have populated it with non empty value.

I think that this is bug because with the following configuration it works as 
expected:
{code:python}
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
{code}

Update: The provided code is working with Spark 3.1.2 without problems, so it 
seems an error due to expression pruning.

The expected result is:

{code}
+---+
|e  |
+---+
|[string2, string3] |
|[string5, string6, string7]|
+---+
{code}

  was:
The following code with Spark 3.2.1 raises an exception:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([
    StructField('o', 
        ArrayType(
            StructType([
     

[jira] [Updated] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow

2022-02-24 Thread Alessandro Bacchini (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Bacchini updated SPARK-38285:

Description: 
The following code with Spark 3.2.1 raises an exception:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([
    StructField('o', 
        ArrayType(
            StructType([
                StructField('s', StringType(), False),
                StructField('b', ArrayType(
                    StructType([
                        StructField('e', StringType(), False)
                    ]),
                    True),
                False)
            ]), 
        True),
    False)])

value = {
    "o": [
        {
            "s": "string1",
            "b": [
                {
                    "e": "string2"
                },
                {
                    "e": "string3"
                }
            ]
        },
        {
            "s": "string4",
            "b": [
                {
                    "e": "string5"
                },
                {
                    "e": "string6"
                },
                {
                    "e": "string7"
                }
            ]
        }
    ]
}

df = (
    spark.createDataFrame([value], schema=t)
    .select(F.explode("o").alias("eo"))
    .select("eo.b.e")
)


df.show()
{code}

The exception message is:
{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at 
org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
at 
org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at 
org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.

Please note that the issue seems to be related to SPARK-37577: I am using the 
same DataFrame schema, but this time I have populated it with non empty value.

I think that this is bug because with the following configuration it works as 
expected:
{code:python}
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
{code}

Update: The provided code is working with Spark 3.1.2 without problems, so it 
seems an error due to expression pruning.

  was:
The following code with Spark 3.2.1 raises an exception:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([
    StructField('o', 
        ArrayType(
            StructType([
                StructField('s', StringType(), False),
                StructField('b', ArrayType(
                    StructType([
                        StructField('e', StringType(), False)
                    ]),
       

  1   2   >