[jira] [Resolved] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-11-11 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45655.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43517
[https://github.com/apache/spark/pull/43517]

> current_date() not supported in Streaming Query Observed metrics
> 
>
> Key: SPARK-45655
> URL: https://issues.apache.org/jira/browse/SPARK-45655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Streaming queries do not support current_date() inside CollectMetrics. The 
> primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
> marked as non-deterministic. However, {{current_date}} and 
> {{current_timestamp}} are both deterministic today, and 
> {{current_batch_timestamp}} should be the same.
>  
> As an example, the query below fails due to observe call on the DataFrame.
>  
> {quote}val inputData = MemoryStream[Timestamp]
> inputData.toDF()
>       .filter("value < current_date()")
>       .observe("metrics", count(expr("value >= 
> current_date()")).alias("dropped"))
>       .writeStream
>       .queryName("ts_metrics_test")
>       .format("memory")
>       .outputMode("append")
>       .start()
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics

2023-11-11 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45655:


Assignee: Bhuwan Sahni

> current_date() not supported in Streaming Query Observed metrics
> 
>
> Key: SPARK-45655
> URL: https://issues.apache.org/jira/browse/SPARK-45655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Streaming queries do not support current_date() inside CollectMetrics. The 
> primary reason is that current_date() (resolves to CurrentBatchTimestamp) is 
> marked as non-deterministic. However, {{current_date}} and 
> {{current_timestamp}} are both deterministic today, and 
> {{current_batch_timestamp}} should be the same.
>  
> As an example, the query below fails due to observe call on the DataFrame.
>  
> {quote}val inputData = MemoryStream[Timestamp]
> inputData.toDF()
>       .filter("value < current_date()")
>       .observe("metrics", count(expr("value >= 
> current_date()")).alias("dropped"))
>       .writeStream
>       .queryName("ts_metrics_test")
>       .format("memory")
>       .outputMode("append")
>       .start()
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size

2023-11-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45894:

Target Version/s:   (was: 3.5.0)

> hive table level setting hadoop.mapred.max.split.size
> -
>
> Key: SPARK-45894
> URL: https://issues.apache.org/jira/browse/SPARK-45894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the scenario of hive table scan, by configuring the 
> hadoop.mapred.max.split.size parameter, you can increase the parallelism of 
> the scan hive table stage, thereby reducing the running time.
> However, if a large table and a small table are in the same query, if only a 
> separate hadoop.mapred.max.split.size parameter is configured, some stages 
> will run a very large number of tasks, and some stages will The number of 
> tasks running is very small. For runtime tasks, the 
> hadoop.mapred.max.split.size parameter can be set separately for each hive 
> table to ensure this balance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45522) Migrate jetty 9 to jetty 12

2023-11-11 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785257#comment-17785257
 ] 

Yang Jie commented on SPARK-45522:
--

[~HF] Based on your experience, is it possible to complete all upgrade work 
before June 24th? Or should I change the ticket's priority to a 9-10 upgrade 
first?
 
 
 
 
 

> Migrate jetty 9 to jetty 12
> ---
>
> Key: SPARK-45522
> URL: https://issues.apache.org/jira/browse/SPARK-45522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But 
> the version span is quite large, need to read the documentation in detail, 
> not sure if it can be completed within the 4.0 cycle, so it's set to low 
> priority.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`

2023-11-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45871:


Assignee: Yang Jie

> Change `.toBuffer.toSeq` to `.toSeq`
> 
>
> Key: SPARK-45871
> URL: https://issues.apache.org/jira/browse/SPARK-45871
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`

2023-11-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45871.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43745
[https://github.com/apache/spark/pull/43745]

> Change `.toBuffer.toSeq` to `.toSeq`
> 
>
> Key: SPARK-45871
> URL: https://issues.apache.org/jira/browse/SPARK-45871
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43100) Mismatch of field name in log event writer and parser for push shuffle metrics

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43100:
---
Labels: pull-request-available  (was: )

> Mismatch of field name in log event writer and parser for push shuffle metrics
> --
>
> Key: SPARK-43100
> URL: https://issues.apache.org/jira/browse/SPARK-43100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ye Zhou
>Priority: Major
>  Labels: pull-request-available
>
> For push based shuffle metrics, when writting out the event to log file, the 
> field name is "Push Based Shuffle", in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L548]
> But when parsing it out in SHS, the expected field name is "Shuffle Push Read 
> Metrics", as shown 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L1264]
> This mismatch makes all the push shuffle metrics 0 from SHS rest calls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44659) SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44659:
---
Labels: pull-request-available  (was: )

> SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality 
> check
> 
>
> Key: SPARK-44659
> URL: https://issues.apache.org/jira/browse/SPARK-44659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Priority: Minor
>  Labels: pull-request-available
>
> Currently {{StoragePartitionJoinParams}} doesn't include 
> {{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. 
> For completeness, we should include it as well since it is a member of the 
> class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44514) Optimize join if maximum number of rows on one side is 1

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44514:
---
Labels: pull-request-available  (was: )

> Optimize join if maximum number of rows on one side is 1
> 
>
> Key: SPARK-44514
> URL: https://issues.apache.org/jira/browse/SPARK-44514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785234#comment-17785234
 ] 

Bruce Robbins commented on SPARK-45896:
---

I think I have a handle on this and will make a PR shortly.

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}
> Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
> - {{Seq[Option[Timestamp]]}}
> - {{Map[Option[Timestamp]]}}
> - {{Seq[Option[Date]]}}
> - {{Map[Option[Date]]}}
> - {{Seq[Option[BigDecimal]]}}
> - {{Map[Option[BigDecimal]]}}
> However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:
> - {{Seq[Option[Map]]}}
> - {{Map[Option[Map]]}}
> - {{Seq[Option[]]}}
> - {{Map[Option[]]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}
Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
- {{Seq[Option[Timestamp]]}}
- {{Map[Option[Timestamp]]}}
- {{Seq[Option[Date]]}}
- {{Map[Option[Date]]}}
- {{Seq[Option[BigDecimal]]}}
- {{Map[Option[BigDecimal]]}}

However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:

- {{Seq[Option[Map]]}}
- {{Map[Option[Map]]}}
- {{Seq[Option[]]}}
- {{Map[Option[]]}}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: 

[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Summary: Expression encoding fails for Seq/Map of 
Option[Seq/Date/Timestamp/BigDecimal]  (was: Expression encoding fails for 
Seq/Map of Option[Seq])

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 

[jira] [Created] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45896:
-

 Summary: Expression encoding fails for Seq/Map of Option[Seq]
 Key: SPARK-45896
 URL: https://issues.apache.org/jira/browse/SPARK-45896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Bruce Robbins


The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45894:
---
Labels: pull-request-available  (was: )

> hive table level setting hadoop.mapred.max.split.size
> -
>
> Key: SPARK-45894
> URL: https://issues.apache.org/jira/browse/SPARK-45894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the scenario of hive table scan, by configuring the 
> hadoop.mapred.max.split.size parameter, you can increase the parallelism of 
> the scan hive table stage, thereby reducing the running time.
> However, if a large table and a small table are in the same query, if only a 
> separate hadoop.mapred.max.split.size parameter is configured, some stages 
> will run a very large number of tasks, and some stages will The number of 
> tasks running is very small. For runtime tasks, the 
> hadoop.mapred.max.split.size parameter can be set separately for each hive 
> table to ensure this balance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45893) Support drop multiple partitions in batch for hive

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45893:
---
Labels: pull-request-available  (was: )

> Support drop multiple partitions in batch for hive
> --
>
> Key: SPARK-45893
> URL: https://issues.apache.org/jira/browse/SPARK-45893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> Support drop partitions in batch to improve the performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45895) Combine multiple like to like all

2023-11-11 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-45895:
---

 Summary: Combine multiple like to like all
 Key: SPARK-45895
 URL: https://issues.apache.org/jira/browse/SPARK-45895
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang



{code:scala}
   spark.sql("create table t(a string, b string, c string) using parquet")
spark.sql(
  """
|select * from t where
|substr(a, 1, 5) like '%a%' and
|substr(a, 1, 5) like '%b%'
|""".stripMargin).explain(true)
{code}

We can optimize the query to:
{code:scala}
spark.sql(
  """
|select * from t where
|substr(a, 1, 5) like all('%a%', '%b%')
|""".stripMargin).explain(true)
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size

2023-11-11 Thread guihuawen (Jira)
guihuawen created SPARK-45894:
-

 Summary: hive table level setting hadoop.mapred.max.split.size
 Key: SPARK-45894
 URL: https://issues.apache.org/jira/browse/SPARK-45894
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: guihuawen
 Fix For: 3.5.0


In the scenario of hive table scan, by configuring the 
hadoop.mapred.max.split.size parameter, you can increase the parallelism of the 
scan hive table stage, thereby reducing the running time.


However, if a large table and a small table are in the same query, if only a 
separate hadoop.mapred.max.split.size parameter is configured, some stages will 
run a very large number of tasks, and some stages will The number of tasks 
running is very small. For runtime tasks, the hadoop.mapred.max.split.size 
parameter can be set separately for each hive table to ensure this balance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45522) Migrate jetty 9 to jetty 12

2023-11-11 Thread HiuFung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785155#comment-17785155
 ] 

HiuFung commented on SPARK-45522:
-

Pushed WIP branch, 9 to 10 bump is straight forward, but 10 - 12 require 
significant changes.

> Migrate jetty 9 to jetty 12
> ---
>
> Key: SPARK-45522
> URL: https://issues.apache.org/jira/browse/SPARK-45522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But 
> the version span is quite large, need to read the documentation in detail, 
> not sure if it can be completed within the 4.0 cycle, so it's set to low 
> priority.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45522) Migrate jetty 9 to jetty 12

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45522:
---
Labels: pull-request-available  (was: )

> Migrate jetty 9 to jetty 12
> ---
>
> Key: SPARK-45522
> URL: https://issues.apache.org/jira/browse/SPARK-45522
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But 
> the version span is quite large, need to read the documentation in detail, 
> not sure if it can be completed within the 4.0 cycle, so it's set to low 
> priority.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org