[jira] [Resolved] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
[ https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45655. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43517 [https://github.com/apache/spark/pull/43517] > current_date() not supported in Streaming Query Observed metrics > > > Key: SPARK-45655 > URL: https://issues.apache.org/jira/browse/SPARK-45655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Streaming queries do not support current_date() inside CollectMetrics. The > primary reason is that current_date() (resolves to CurrentBatchTimestamp) is > marked as non-deterministic. However, {{current_date}} and > {{current_timestamp}} are both deterministic today, and > {{current_batch_timestamp}} should be the same. > > As an example, the query below fails due to observe call on the DataFrame. > > {quote}val inputData = MemoryStream[Timestamp] > inputData.toDF() > .filter("value < current_date()") > .observe("metrics", count(expr("value >= > current_date()")).alias("dropped")) > .writeStream > .queryName("ts_metrics_test") > .format("memory") > .outputMode("append") > .start() > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
[ https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45655: Assignee: Bhuwan Sahni > current_date() not supported in Streaming Query Observed metrics > > > Key: SPARK-45655 > URL: https://issues.apache.org/jira/browse/SPARK-45655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Original Estimate: 48h > Remaining Estimate: 48h > > Streaming queries do not support current_date() inside CollectMetrics. The > primary reason is that current_date() (resolves to CurrentBatchTimestamp) is > marked as non-deterministic. However, {{current_date}} and > {{current_timestamp}} are both deterministic today, and > {{current_batch_timestamp}} should be the same. > > As an example, the query below fails due to observe call on the DataFrame. > > {quote}val inputData = MemoryStream[Timestamp] > inputData.toDF() > .filter("value < current_date()") > .observe("metrics", count(expr("value >= > current_date()")).alias("dropped")) > .writeStream > .queryName("ts_metrics_test") > .format("memory") > .outputMode("append") > .start() > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size
[ https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45894: Target Version/s: (was: 3.5.0) > hive table level setting hadoop.mapred.max.split.size > - > > Key: SPARK-45894 > URL: https://issues.apache.org/jira/browse/SPARK-45894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: guihuawen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In the scenario of hive table scan, by configuring the > hadoop.mapred.max.split.size parameter, you can increase the parallelism of > the scan hive table stage, thereby reducing the running time. > However, if a large table and a small table are in the same query, if only a > separate hadoop.mapred.max.split.size parameter is configured, some stages > will run a very large number of tasks, and some stages will The number of > tasks running is very small. For runtime tasks, the > hadoop.mapred.max.split.size parameter can be set separately for each hive > table to ensure this balance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45522) Migrate jetty 9 to jetty 12
[ https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785257#comment-17785257 ] Yang Jie commented on SPARK-45522: -- [~HF] Based on your experience, is it possible to complete all upgrade work before June 24th? Or should I change the ticket's priority to a 9-10 upgrade first? > Migrate jetty 9 to jetty 12 > --- > > Key: SPARK-45522 > URL: https://issues.apache.org/jira/browse/SPARK-45522 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But > the version span is quite large, need to read the documentation in detail, > not sure if it can be completed within the 4.0 cycle, so it's set to low > priority. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`
[ https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45871: Assignee: Yang Jie > Change `.toBuffer.toSeq` to `.toSeq` > > > Key: SPARK-45871 > URL: https://issues.apache.org/jira/browse/SPARK-45871 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`
[ https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45871. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43745 [https://github.com/apache/spark/pull/43745] > Change `.toBuffer.toSeq` to `.toSeq` > > > Key: SPARK-45871 > URL: https://issues.apache.org/jira/browse/SPARK-45871 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43100) Mismatch of field name in log event writer and parser for push shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-43100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43100: --- Labels: pull-request-available (was: ) > Mismatch of field name in log event writer and parser for push shuffle metrics > -- > > Key: SPARK-43100 > URL: https://issues.apache.org/jira/browse/SPARK-43100 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ye Zhou >Priority: Major > Labels: pull-request-available > > For push based shuffle metrics, when writting out the event to log file, the > field name is "Push Based Shuffle", in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L548] > But when parsing it out in SHS, the expected field name is "Shuffle Push Read > Metrics", as shown > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L1264] > This mismatch makes all the push shuffle metrics 0 from SHS rest calls. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44659) SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check
[ https://issues.apache.org/jira/browse/SPARK-44659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44659: --- Labels: pull-request-available (was: ) > SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality > check > > > Key: SPARK-44659 > URL: https://issues.apache.org/jira/browse/SPARK-44659 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Priority: Minor > Labels: pull-request-available > > Currently {{StoragePartitionJoinParams}} doesn't include > {{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. > For completeness, we should include it as well since it is a member of the > class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44514) Optimize join if maximum number of rows on one side is 1
[ https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44514: --- Labels: pull-request-available (was: ) > Optimize join if maximum number of rows on one side is 1 > > > Key: SPARK-44514 > URL: https://issues.apache.org/jira/browse/SPARK-44514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785234#comment-17785234 ] Bruce Robbins commented on SPARK-45896: --- I think I have a handle on this and will make a PR shortly. > Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] > -- > > Key: SPARK-45896 > URL: https://issues.apache.org/jira/browse/SPARK-45896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following action fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > val df = Seq(Seq(Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class > scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) > AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > However, it succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: array>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) > {noformat} > Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: > externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), > assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), > lambdavariable(ExternalMapToCatalyst_value, ObjectType(class > java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -3), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, > ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), > ObjectType(class scala.Option))), None), input[0, > scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > As with the first example, this succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: map>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) > {noformat} > Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3: > - {{Seq[Option[Timestamp]]}} > - {{Map[Option[Timestamp]]}} > - {{Seq[Option[Date]]}} > - {{Map[Option[Date]]}} > - {{Seq[Option[BigDecimal]]}} > - {{Map[Option[BigDecimal]]}} > However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master: > - {{Seq[Option[Map]]}} > - {{Map[Option[Map]]}} > - {{Seq[Option[]]}} > - {{Map[Option[]]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Description: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3: - {{Seq[Option[Timestamp]]}} - {{Map[Option[Timestamp]]}} - {{Seq[Option[Date]]}} - {{Map[Option[Date]]}} - {{Seq[Option[BigDecimal]]}} - {{Map[Option[BigDecimal]]}} However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master: - {{Seq[Option[Map]]}} - {{Map[Option[Map]]}} - {{Seq[Option[]]}} - {{Map[Option[]]}} was: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df:
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Summary: Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] (was: Expression encoding fails for Seq/Map of Option[Seq]) > Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] > -- > > Key: SPARK-45896 > URL: https://issues.apache.org/jira/browse/SPARK-45896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following action fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > val df = Seq(Seq(Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class > scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) > AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > However, it succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: array>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) > {noformat} > Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: > externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), > assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), > lambdavariable(ExternalMapToCatalyst_value, ObjectType(class > java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -3), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, > ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), > ObjectType(class scala.Option))), None), input[0, > scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > As with the first example, this succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: map>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Description: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} was: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of option of sequence also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to
[jira] [Created] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]
Bruce Robbins created SPARK-45896: - Summary: Expression encoding fails for Seq/Map of Option[Seq] Key: SPARK-45896 URL: https://issues.apache.org/jira/browse/SPARK-45896 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1 Reporter: Bruce Robbins The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of option of sequence also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size
[ https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45894: --- Labels: pull-request-available (was: ) > hive table level setting hadoop.mapred.max.split.size > - > > Key: SPARK-45894 > URL: https://issues.apache.org/jira/browse/SPARK-45894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: guihuawen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In the scenario of hive table scan, by configuring the > hadoop.mapred.max.split.size parameter, you can increase the parallelism of > the scan hive table stage, thereby reducing the running time. > However, if a large table and a small table are in the same query, if only a > separate hadoop.mapred.max.split.size parameter is configured, some stages > will run a very large number of tasks, and some stages will The number of > tasks running is very small. For runtime tasks, the > hadoop.mapred.max.split.size parameter can be set separately for each hive > table to ensure this balance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45893) Support drop multiple partitions in batch for hive
[ https://issues.apache.org/jira/browse/SPARK-45893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45893: --- Labels: pull-request-available (was: ) > Support drop multiple partitions in batch for hive > -- > > Key: SPARK-45893 > URL: https://issues.apache.org/jira/browse/SPARK-45893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Wechar >Priority: Major > Labels: pull-request-available > > Support drop partitions in batch to improve the performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45895) Combine multiple like to like all
Yuming Wang created SPARK-45895: --- Summary: Combine multiple like to like all Key: SPARK-45895 URL: https://issues.apache.org/jira/browse/SPARK-45895 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang {code:scala} spark.sql("create table t(a string, b string, c string) using parquet") spark.sql( """ |select * from t where |substr(a, 1, 5) like '%a%' and |substr(a, 1, 5) like '%b%' |""".stripMargin).explain(true) {code} We can optimize the query to: {code:scala} spark.sql( """ |select * from t where |substr(a, 1, 5) like all('%a%', '%b%') |""".stripMargin).explain(true) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size
guihuawen created SPARK-45894: - Summary: hive table level setting hadoop.mapred.max.split.size Key: SPARK-45894 URL: https://issues.apache.org/jira/browse/SPARK-45894 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: guihuawen Fix For: 3.5.0 In the scenario of hive table scan, by configuring the hadoop.mapred.max.split.size parameter, you can increase the parallelism of the scan hive table stage, thereby reducing the running time. However, if a large table and a small table are in the same query, if only a separate hadoop.mapred.max.split.size parameter is configured, some stages will run a very large number of tasks, and some stages will The number of tasks running is very small. For runtime tasks, the hadoop.mapred.max.split.size parameter can be set separately for each hive table to ensure this balance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45522) Migrate jetty 9 to jetty 12
[ https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785155#comment-17785155 ] HiuFung commented on SPARK-45522: - Pushed WIP branch, 9 to 10 bump is straight forward, but 10 - 12 require significant changes. > Migrate jetty 9 to jetty 12 > --- > > Key: SPARK-45522 > URL: https://issues.apache.org/jira/browse/SPARK-45522 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But > the version span is quite large, need to read the documentation in detail, > not sure if it can be completed within the 4.0 cycle, so it's set to low > priority. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45522) Migrate jetty 9 to jetty 12
[ https://issues.apache.org/jira/browse/SPARK-45522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45522: --- Labels: pull-request-available (was: ) > Migrate jetty 9 to jetty 12 > --- > > Key: SPARK-45522 > URL: https://issues.apache.org/jira/browse/SPARK-45522 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > Jetty 12 supports JakartaEE 8/JakartaEE 9/JakartaEE 10 simultaneously. But > the version span is quite large, need to read the documentation in detail, > not sure if it can be completed within the 4.0 cycle, so it's set to low > priority. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org