[jira] [Resolved] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge resolved SPARK-45657. Fix Version/s: 3.5.0 Resolution: Fixed The issue is fixed in 3.5.0 > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: John Zhuge >Priority: Major > Fix For: 3.5.0 > > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-45657: --- Affects Version/s: 3.4.1 3.4.0 > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45658: - Description: The canonicalization of (buildKeys: Seq[Expression]) in the class DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by calling buildKeys.map(_.canonicalized) The above would result in incorrect canonicalization as it would not be normalizing the exprIds relative to buildQuery output The fix is to use the buildQuery : LogicalPlan's output to normalize the buildKeys expression as given below, using the standard approach. buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), Will be filing a PR and bug test for the same. was: The canonicalization of (buildKeys: Seq[Expression]) in the class DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by calling buildKeys.map(_.canonicalized) The above would result in incorrect canonicalization as it would not be normalizing the exprIds The fix is to use the buildQuery : LogicalPlan's output to normalize the buildKeys expression as given below, using the standard approach. buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), Will be filing a PR and bug test for the same. > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45658: - Priority: Major (was: Critical) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
Asif created SPARK-45658: Summary: Canonicalization of DynamicPruningSubquery is broken Key: SPARK-45658 URL: https://issues.apache.org/jira/browse/SPARK-45658 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.5.1 Reporter: Asif The canonicalization of (buildKeys: Seq[Expression]) in the class DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by calling buildKeys.map(_.canonicalized) The above would result in incorrect canonicalization as it would not be normalizing the exprIds The fix is to use the buildQuery : LogicalPlan's output to normalize the buildKeys expression as given below, using the standard approach. buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 4:55 AM: -- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be able to find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779326#comment-17779326 ] John Zhuge commented on SPARK-45657: It is fixed in main branch {code:java} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT /_/Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.7) Type in expressions to have them evaluated. Type :help for more information. 23/10/24 21:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://192.168.86.29:4040 Spark context available as 'sc' (master = local[*], app id = local-1698208231783). Spark session available as 'spark'.scala> spark.sql("select 1 id union select 's2' id").cache() val res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string]scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan val res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Union false, false :- InMemoryRelation [id#11], StorageLevel(disk, memory, deserialized, 1 replicas) : +- AdaptiveSparkPlan isFinalPlan=false : +- HashAggregate(keys=[id#2], functions=[], output=[id#2]) : +- Exchange hashpartitioning(id#2, 200), ENSURE_REQUIREMENTS, [plan_id=30] : +- HashAggregate(keys=[id#2], functions=[], output=[id#2]) : +- Union : :- Project [1 AS id#2] : : +- Scan OneRowRelation[] : +- Project [s2 AS id#1] : +- Scan OneRowRelation[] +- Project [s3 AS s3#13] +- OneRowRelation {code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779283#comment-17779283 ] John Zhuge commented on SPARK-45657: Interesting, there is warning in Dataset.union {code:java} def union(other: Dataset[T]): Dataset[T] = withSetOperator { // This breaks caching, but it's usually ok because it addresses a very specific use case: // using union to union many files or partitions. CombineUnions(Union(logicalPlan, other.logicalPlan)) } {code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-45657: --- Affects Version/s: 3.3.2 (was: 3.4.1) > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779282#comment-17779282 ] John Zhuge commented on SPARK-45657: Checking whether this is still an issue in main branch. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 12:38 AM: --- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 12:36 AM: --- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779281#comment-17779281 ] John Zhuge commented on SPARK-45657: Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45656) Fix observation when named observations with the same name on different datasets.
[ https://issues.apache.org/jira/browse/SPARK-45656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45656: --- Labels: pull-request-available (was: ) > Fix observation when named observations with the same name on different > datasets. > - > > Key: SPARK-45656 > URL: https://issues.apache.org/jira/browse/SPARK-45656 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
John Zhuge created SPARK-45657: -- Summary: Caching SQL UNION of different column data types does not work inside Dataset.union Key: SPARK-45657 URL: https://issues.apache.org/jira/browse/SPARK-45657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: John Zhuge Cache SQL UNION of 2 sides with different column data types {code:java} scala> spark.sql("select 1 id union select 's2' id").cache() {code} Dataset.union does not leverage the cache {code:java} scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Union false, false :- Aggregate [id#109], [id#109] : +- Union false, false : :- Project [1 AS id#109] : : +- OneRowRelation : +- Project [s2 AS id#108] : +- OneRowRelation +- Project [s3 AS s3#111] +- OneRowRelation {code} SQL UNION of the cached SQL UNION does use the cache! Please note `InMemoryRelation` used. {code:java} scala> spark.sql("(select 1 id union select 's2' id) union select 's3'").queryExecution.optimizedPlan res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Aggregate [id#117], [id#117] +- Union false, false :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, [plan_id=241] : +- *(3) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Union : :- *(1) Project [1 AS id#100] : : +- *(1) Scan OneRowRelation[] : +- *(2) Project [s2 AS id#99] : +- *(2) Scan OneRowRelation[] +- Project [s3 AS s3#116] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45656) Fix observation when named observations with the same name on different datasets.
Takuya Ueshin created SPARK-45656: - Summary: Fix observation when named observations with the same name on different datasets. Key: SPARK-45656 URL: https://issues.apache.org/jira/browse/SPARK-45656 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45648) Add sql/api and common/utils to modules.py
[ https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45648: Assignee: Ruifeng Zheng > Add sql/api and common/utils to modules.py > -- > > Key: SPARK-45648 > URL: https://issues.apache.org/jira/browse/SPARK-45648 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45648) Add sql/api and common/utils to modules.py
[ https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45648. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43501 [https://github.com/apache/spark/pull/43501] > Add sql/api and common/utils to modules.py > -- > > Key: SPARK-45648 > URL: https://issues.apache.org/jira/browse/SPARK-45648 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45648) Add sql/api and common/utils to modules.py
[ https://issues.apache.org/jira/browse/SPARK-45648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45648: --- Labels: pull-request-available (was: ) > Add sql/api and common/utils to modules.py > -- > > Key: SPARK-45648 > URL: https://issues.apache.org/jira/browse/SPARK-45648 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45622) java -target should use java.version instead of 17
[ https://issues.apache.org/jira/browse/SPARK-45622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45622. -- Resolution: Invalid > java -target should use java.version instead of 17 > -- > > Key: SPARK-45622 > URL: https://issues.apache.org/jira/browse/SPARK-45622 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Zhongwei Zhu >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45651) Snapshots of some packages are not published any more
[ https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45651: Assignee: Enrico Minack > Snapshots of some packages are not published any more > - > > Key: SPARK-45651 > URL: https://issues.apache.org/jira/browse/SPARK-45651 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Labels: pull-request-available > > Snapshots of some packages are not been published anymore, e.g. > spark-sql_2.13-4.0.0 has not been published since Sep, 13th: > https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/ > There have been some attempts to fix CI: SPARK-45535 SPARK-45536 > Assumption is that memory consumption during build exceeds the available > memory of the Github host. > The following could be attempted: > - enable manual trigger of the {{publish_snapshots.yml}} workflow > - enable some memory use logging to proof that exceeded memory is the root > cause > - attempt to reduce memory footprint and see impact in above logging > - revert memory use logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45651) Snapshots of some packages are not published any more
[ https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45651. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43512 [https://github.com/apache/spark/pull/43512] > Snapshots of some packages are not published any more > - > > Key: SPARK-45651 > URL: https://issues.apache.org/jira/browse/SPARK-45651 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Snapshots of some packages are not been published anymore, e.g. > spark-sql_2.13-4.0.0 has not been published since Sep, 13th: > https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/ > There have been some attempts to fix CI: SPARK-45535 SPARK-45536 > Assumption is that memory consumption during build exceeds the available > memory of the Github host. > The following could be attempted: > - enable manual trigger of the {{publish_snapshots.yml}} workflow > - enable some memory use logging to proof that exceeded memory is the root > cause > - attempt to reduce memory footprint and see impact in above logging > - revert memory use logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45640) Fix flaky ProtobufCatalystDataConversionSuite
[ https://issues.apache.org/jira/browse/SPARK-45640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45640: Assignee: BingKun Pan > Fix flaky ProtobufCatalystDataConversionSuite > - > > Key: SPARK-45640 > URL: https://issues.apache.org/jira/browse/SPARK-45640 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45640) Fix flaky ProtobufCatalystDataConversionSuite
[ https://issues.apache.org/jira/browse/SPARK-45640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45640. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43493 [https://github.com/apache/spark/pull/43493] > Fix flaky ProtobufCatalystDataConversionSuite > - > > Key: SPARK-45640 > URL: https://issues.apache.org/jira/browse/SPARK-45640 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
[ https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779264#comment-17779264 ] Bhuwan Sahni commented on SPARK-45655: -- PR link https://github.com/apache/spark/pull/43517 > current_date() not supported in Streaming Query Observed metrics > > > Key: SPARK-45655 > URL: https://issues.apache.org/jira/browse/SPARK-45655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Original Estimate: 48h > Remaining Estimate: 48h > > Streaming queries do not support current_date() inside CollectMetrics. The > primary reason is that current_date() (resolves to CurrentBatchTimestamp) is > marked as non-deterministic. However, {{current_date}} and > {{current_timestamp}} are both deterministic today, and > {{current_batch_timestamp}} should be the same. > > As an example, the query below fails due to observe call on the DataFrame. > > {quote}val inputData = MemoryStream[Timestamp] > inputData.toDF() > .filter("value < current_date()") > .observe("metrics", count(expr("value >= > current_date()")).alias("dropped")) > .writeStream > .queryName("ts_metrics_test") > .format("memory") > .outputMode("append") > .start() > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
[ https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45655: --- Labels: pull-request-available (was: ) > current_date() not supported in Streaming Query Observed metrics > > > Key: SPARK-45655 > URL: https://issues.apache.org/jira/browse/SPARK-45655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Original Estimate: 48h > Remaining Estimate: 48h > > Streaming queries do not support current_date() inside CollectMetrics. The > primary reason is that current_date() (resolves to CurrentBatchTimestamp) is > marked as non-deterministic. However, {{current_date}} and > {{current_timestamp}} are both deterministic today, and > {{current_batch_timestamp}} should be the same. > > As an example, the query below fails due to observe call on the DataFrame. > > {quote}val inputData = MemoryStream[Timestamp] > inputData.toDF() > .filter("value < current_date()") > .observe("metrics", count(expr("value >= > current_date()")).alias("dropped")) > .writeStream > .queryName("ts_metrics_test") > .format("memory") > .outputMode("append") > .start() > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45654) Add Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45654: --- Labels: pull-request-available (was: ) > Add Python data source write API > > > Key: SPARK-45654 > URL: https://issues.apache.org/jira/browse/SPARK-45654 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add Python data source write API in datasource.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
Bhuwan Sahni created SPARK-45655: Summary: current_date() not supported in Streaming Query Observed metrics Key: SPARK-45655 URL: https://issues.apache.org/jira/browse/SPARK-45655 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.5.0, 3.4.1 Reporter: Bhuwan Sahni Streaming queries do not support current_date() inside CollectMetrics. The primary reason is that current_date() (resolves to CurrentBatchTimestamp) is marked as non-deterministic. However, {{current_date}} and {{current_timestamp}} are both deterministic today, and {{current_batch_timestamp}} should be the same. As an example, the query below fails due to observe call on the DataFrame. {quote}val inputData = MemoryStream[Timestamp] inputData.toDF() .filter("value < current_date()") .observe("metrics", count(expr("value >= current_date()")).alias("dropped")) .writeStream .queryName("ts_metrics_test") .format("memory") .outputMode("append") .start() {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45655) current_date() not supported in Streaming Query Observed metrics
[ https://issues.apache.org/jira/browse/SPARK-45655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779239#comment-17779239 ] Bhuwan Sahni commented on SPARK-45655: -- I am working on a fix for this issue, and will submit a PR soon. > current_date() not supported in Streaming Query Observed metrics > > > Key: SPARK-45655 > URL: https://issues.apache.org/jira/browse/SPARK-45655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bhuwan Sahni >Priority: Major > Original Estimate: 48h > Remaining Estimate: 48h > > Streaming queries do not support current_date() inside CollectMetrics. The > primary reason is that current_date() (resolves to CurrentBatchTimestamp) is > marked as non-deterministic. However, {{current_date}} and > {{current_timestamp}} are both deterministic today, and > {{current_batch_timestamp}} should be the same. > > As an example, the query below fails due to observe call on the DataFrame. > > {quote}val inputData = MemoryStream[Timestamp] > inputData.toDF() > .filter("value < current_date()") > .observe("metrics", count(expr("value >= > current_date()")).alias("dropped")) > .writeStream > .queryName("ts_metrics_test") > .format("memory") > .outputMode("append") > .start() > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45654) Add Python data source write API
Allison Wang created SPARK-45654: Summary: Add Python data source write API Key: SPARK-45654 URL: https://issues.apache.org/jira/browse/SPARK-45654 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add Python data source write API in datasource.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45503) RocksDB State Store to Use LZ4 Compression
[ https://issues.apache.org/jira/browse/SPARK-45503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45503. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43338 [https://github.com/apache/spark/pull/43338] > RocksDB State Store to Use LZ4 Compression > -- > > Key: SPARK-45503 > URL: https://issues.apache.org/jira/browse/SPARK-45503 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > LZ4 is generally faster than Snappy. That's probably we use LZ4 in changelogs > and other places by default. However, we don't change RocksDB's default of > Snappy compression style. The RocksDB Team recommend LZ4 or ZSTD and the > default is kept to Snappy only for backward compatible reason. We should use > LZ4 instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.
[ https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang resolved SPARK-45653. -- Resolution: Not A Problem > Refractor XMLSuite to allow other test suites to easily extend and override. > > > Key: SPARK-45653 > URL: https://issues.apache.org/jira/browse/SPARK-45653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Refactor XmlSuite to integrate dataframe readers, allowing other test suites > to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.
[ https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45653: --- Labels: pull-request-available (was: ) > Refractor XMLSuite to allow other test suites to easily extend and override. > > > Key: SPARK-45653 > URL: https://issues.apache.org/jira/browse/SPARK-45653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Refactor XmlSuite to integrate dataframe readers, allowing other test suites > to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45524) Initial support for Python data source read API
[ https://issues.apache.org/jira/browse/SPARK-45524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-45524. --- Fix Version/s: 4.0.0 Assignee: Allison Wang Resolution: Fixed Issue resolved by pull request 43360 https://github.com/apache/spark/pull/43360 > Initial support for Python data source read API > --- > > Key: SPARK-45524 > URL: https://issues.apache.org/jira/browse/SPARK-45524 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add API for data source and data source reader and add Catalyst + execution > support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.
[ https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-45653: - Summary: Refractor XMLSuite to allow other test suites to easily extend and override. (was: Refractor XMLSuite) > Refractor XMLSuite to allow other test suites to easily extend and override. > > > Key: SPARK-45653 > URL: https://issues.apache.org/jira/browse/SPARK-45653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Shujing Yang >Priority: Major > > Refactor XmlSuite to integrate dataframe readers, allowing other test suites > to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45653) Refractor XMLSuite
Shujing Yang created SPARK-45653: Summary: Refractor XMLSuite Key: SPARK-45653 URL: https://issues.apache.org/jira/browse/SPARK-45653 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Shujing Yang Refactor XmlSuite to integrate dataframe readers, allowing other test suites to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45643) Replace `s.c.mutable.MapOps#transform` with `s.c.mutable.MapOps#mapValuesInPlace`
[ https://issues.apache.org/jira/browse/SPARK-45643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45643: - Assignee: Yang Jie > Replace `s.c.mutable.MapOps#transform` with > `s.c.mutable.MapOps#mapValuesInPlace` > - > > Key: SPARK-45643 > URL: https://issues.apache.org/jira/browse/SPARK-45643 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > > {code:java} > @deprecated("Use mapValuesInPlace instead", "2.13.0") > @inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45643) Replace `s.c.mutable.MapOps#transform` with `s.c.mutable.MapOps#mapValuesInPlace`
[ https://issues.apache.org/jira/browse/SPARK-45643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45643. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43500 [https://github.com/apache/spark/pull/43500] > Replace `s.c.mutable.MapOps#transform` with > `s.c.mutable.MapOps#mapValuesInPlace` > - > > Key: SPARK-45643 > URL: https://issues.apache.org/jira/browse/SPARK-45643 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > @deprecated("Use mapValuesInPlace instead", "2.13.0") > @inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45651) Snapshots of some packages are not published any more
[ https://issues.apache.org/jira/browse/SPARK-45651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45651: --- Labels: pull-request-available (was: ) > Snapshots of some packages are not published any more > - > > Key: SPARK-45651 > URL: https://issues.apache.org/jira/browse/SPARK-45651 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Enrico Minack >Priority: Major > Labels: pull-request-available > > Snapshots of some packages are not been published anymore, e.g. > spark-sql_2.13-4.0.0 has not been published since Sep, 13th: > https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/ > There have been some attempts to fix CI: SPARK-45535 SPARK-45536 > Assumption is that memory consumption during build exceeds the available > memory of the Github host. > The following could be attempted: > - enable manual trigger of the {{publish_snapshots.yml}} workflow > - enable some memory use logging to proof that exceeded memory is the root > cause > - attempt to reduce memory footprint and see impact in above logging > - revert memory use logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering
Chao Sun created SPARK-45652: Summary: SPJ: Handle empty input partitions after dynamic filtering Key: SPARK-45652 URL: https://issues.apache.org/jira/browse/SPARK-45652 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.1 Reporter: Chao Sun When the number of input partitions become 0 after dynamic filtering, in {{BatchScanExec}}, currently SPJ will fail with error: {code} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218) at org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) {code} This is because {{groupPartitions}} will return {{None}} for this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44405) Reduce code duplication in group-based DELETE and MERGE tests
[ https://issues.apache.org/jira/browse/SPARK-44405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44405: --- Labels: pull-request-available (was: ) > Reduce code duplication in group-based DELETE and MERGE tests > - > > Key: SPARK-44405 > URL: https://issues.apache.org/jira/browse/SPARK-44405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Priority: Major > Labels: pull-request-available > > See [this|https://github.com/apache/spark/pull/41600#discussion_r1230014119] > discussion. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45651) Snapshots of some packages are not published any more
Enrico Minack created SPARK-45651: - Summary: Snapshots of some packages are not published any more Key: SPARK-45651 URL: https://issues.apache.org/jira/browse/SPARK-45651 Project: Spark Issue Type: Bug Components: Build Affects Versions: 4.0.0 Reporter: Enrico Minack Snapshots of some packages are not been published anymore, e.g. spark-sql_2.13-4.0.0 has not been published since Sep, 13th: https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.13/4.0.0-SNAPSHOT/ There have been some attempts to fix CI: SPARK-45535 SPARK-45536 Assumption is that memory consumption during build exceeds the available memory of the Github host. The following could be attempted: - enable manual trigger of the {{publish_snapshots.yml}} workflow - enable some memory use logging to proof that exceeded memory is the root cause - attempt to reduce memory footprint and see impact in above logging - revert memory use logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
[ https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45646: Assignee: Cheng Pan > Remove hardcoding time variables prior to Hive 2.0 > -- > > Key: SPARK-45646 > URL: https://issues.apache.org/jira/browse/SPARK-45646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
[ https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45646. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43506 [https://github.com/apache/spark/pull/43506] > Remove hardcoding time variables prior to Hive 2.0 > -- > > Key: SPARK-45646 > URL: https://issues.apache.org/jira/browse/SPARK-45646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45650) fix dev/mina get scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangjiafu updated SPARK-45650: -- Description: Now the ci is executing ./dev/mina will generate an incompatible error with scala2.12. Sorry, I don't know how to fix it [info] [launcher] getting org.scala-sbt sbt 1.9.3 (this may take some time)... [info] [launcher] getting Scala 2.12.18 (for sbt)... was: [info] [launcher] getting org.scala-sbt sbt 1.9.3 (this may take some time)... [info] [launcher] getting Scala 2.12.18 (for sbt)... > fix dev/mina get scala 2.12 > > > Key: SPARK-45650 > URL: https://issues.apache.org/jira/browse/SPARK-45650 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: tangjiafu >Priority: Major > > Now the ci is executing ./dev/mina will generate an incompatible error with > scala2.12. Sorry, I don't know how to fix it > [info] [launcher] getting org.scala-sbt sbt 1.9.3 (this may take some > time)... > [info] [launcher] getting Scala 2.12.18 (for sbt)... -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45650) fix dev/mina get scala 2.12
tangjiafu created SPARK-45650: - Summary: fix dev/mina get scala 2.12 Key: SPARK-45650 URL: https://issues.apache.org/jira/browse/SPARK-45650 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: tangjiafu [info] [launcher] getting org.scala-sbt sbt 1.9.3 (this may take some time)... [info] [launcher] getting Scala 2.12.18 (for sbt)... -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage
[ https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779048#comment-17779048 ] Rasmus Schøler Sørensen commented on SPARK-31836: - We have also encountered this bug. Rather unfortunate that this bug has persisted for at least 3.5 years without resolution. We would like to do what we can to help resolve this issue. In the mean time, I guess we will mitigate this issue by first loading the raw file data into a "raw" table (using `input_file_name()` to populate column with source input file name column), then process the raw table and apply the UDF in a second step, outputting to a second table. For the record, I've included our observations regarding the extend of this bug below: h2. Findings: The issue occurs whenever a Python UDF is used, both when using `spark.read` and when using `spark.readStream`. We did not observe any cases where the read method would affect whether the bug manifested or not (i.e. `spark.read` vs `spark.readStream.text` vs 'cloudFiles' stream). In all cases, the bug only manifested in when `input_file_name()` was used in conjunction with a UDF. The issue was observed in the following versions, regardless of whether the UDF was placed before or after `input_file_name()`: - Spark 3.5.0 (Databricks Runtime 14.1). - Spark 3.4.1 (Databricks Runtime 13.3). - Spark 3.3.2 (Databricks Runtime 12.2). - Spark 3.3.0 (Databricks Runtime 11.3). For the following versions, we only observed the issue when the UDF column was placed *before* `input_file_name()`: - Spark 3.2.1 (Databricks Runtime 10.4). - Spark 3.1.2 (Databricks Runtime 9.1). h2. Methodology: We tested four ways of loading data: # Using `spark.read`, without a Python UDF. # Using `spark.read`, with a Python UDF. # Using `spark.readStream`, without a Python UDF. # Using `spark.readStream`, with a Python UDF. The following read methods and formats were tested: - Raw text-file read: `spark.read.format('text').load(...)` - Text-file stream: `spark.readStream.text(...)`. - 'cloudFiles' text stream: `spark.readStream.format('cloudFiles').option("cloudFiles.format", "text").load(...)` Input data consisted of a single folder with 2206 text files, each text file containing an average of 732 lines, with each line representing a single value (in this case, a file path), in total 1615051 rows/lines across all files. All reads were output to a delta table. The delta-table was subsequently analyzed for number of distinct values of the `input_file_name` column. In cases where the bug manifested, the number of distinct files was typically around 70-140 (with the expected/correct number being 2206). Everything was run inside a "Databricks" environment. Note that Databricks sometimes adds some "special sauce" to their version of Spark, although generally the "Databricks Spark" is very close to standard Apache Spark. The cluster used for all tests was a 4-core single-node cluster. > input_file_name() gives wrong value following Python UDF usage > -- > > Key: SPARK-31836 > URL: https://issues.apache.org/jira/browse/SPARK-31836 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wesley Hildebrandt >Priority: Major > > I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8. > The following commands demonstrate that the input_file_name() function > sometimes returns the wrong filename following usage of a Python UDF: > {code} > $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done > $ pyspark > >>> import pyspark.sql.functions as F > >>> spark.readStream.text('file:///tmp/test-file-*', > >>> wholetext=True).withColumn('file1', > >>> F.input_file_name()).withColumn('udf', F.udf(lambda > >>> x:x)('value')).withColumn('file2', > >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda > >>> df,_: df.select('file1','file2').show(truncate=False, > >>> vertical=True)).start().awaitTermination() > {code} > A few notes about this bug: > * It happens with many different files, so it's not related to the file > contents > * It also happens loading files from HDFS, so storage location is not a > factor > * It also happens using .csv() to read the files instead of .text(), so > input format is not a factor > * I have not been able to cause the error without using readStream, so it > seems to be related to streaming > * The bug also happens using spark-submit to send a job to my cluster > * I haven't tested an older version, but it's possible that Spark pulls > 24958 and 25321([https://github.com/apache/spark/pull/24958], > [https://github.com/apache/spark/pull/25321]) to fix issue 28153 >
[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adi Wehrli updated SPARK-45644: --- Description: A Spark job ran successfully with Spark 3.2.x and 3.3.x. But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job with the same data the following always occurs now: {code} scala.Some is not a valid external type for schema of array {code} The corresponding stacktrace is: {code} 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch worker for task 0.0 in stage 0.0 (TID 0)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) [spark-core_2.12-3.5.0.jar:3.5.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch worker for task 1.0 in stage 0.0 (TID 1)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at
[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adi Wehrli updated SPARK-45644: --- Summary: After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array" (was: After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "is not valid external type") > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) the following always > occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at >
[jira] [Updated] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`
[ https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45649: --- Labels: pull-request-available (was: ) > Unify the prepare framework for `OffsetWindowFunctionFrame` > --- > > Key: SPARK-45649 > URL: https://issues.apache.org/jira/browse/SPARK-45649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Currently, the implementation the `prepare` of all the > `OffsetWindowFunctionFrame` have the same code logic show below. > ``` > override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { > if (offset > rows.length) { > fillDefaultValue(EmptyRow) > } else { > resetStates(rows) > if (ignoreNulls) { > ... > } else { > ... > } > } > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`
[ https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45649: --- Summary: Unify the prepare framework for `OffsetWindowFunctionFrame` (was: Unified the prepare framework for `OffsetWindowFunctionFrame`) > Unify the prepare framework for `OffsetWindowFunctionFrame` > --- > > Key: SPARK-45649 > URL: https://issues.apache.org/jira/browse/SPARK-45649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > > Currently, the implementation the `prepare` of all the > `OffsetWindowFunctionFrame` have the same code logic show below. > ``` > override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { > if (offset > rows.length) { > fillDefaultValue(EmptyRow) > } else { > resetStates(rows) > if (ignoreNulls) { > ... > } else { > ... > } > } > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45649) Unified the prepare framework for `OffsetWindowFunctionFrame`
Jiaan Geng created SPARK-45649: -- Summary: Unified the prepare framework for `OffsetWindowFunctionFrame` Key: SPARK-45649 URL: https://issues.apache.org/jira/browse/SPARK-45649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng Currently, the implementation the `prepare` of all the `OffsetWindowFunctionFrame` have the same code logic show below. ``` override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { if (offset > rows.length) { fillDefaultValue(EmptyRow) } else { resetStates(rows) if (ignoreNulls) { ... } else { ... } } } ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45648) Add sql/api and common/utils to modules.py
Ruifeng Zheng created SPARK-45648: - Summary: Add sql/api and common/utils to modules.py Key: SPARK-45648 URL: https://issues.apache.org/jira/browse/SPARK-45648 Project: Spark Issue Type: Test Components: Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45647) Spark Connect API to propagate per request context
Juliusz Sompolski created SPARK-45647: - Summary: Spark Connect API to propagate per request context Key: SPARK-45647 URL: https://issues.apache.org/jira/browse/SPARK-45647 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Juliusz Sompolski There is an extension point to pass arbitrary proto extension in Spark Connect UserContext, but there is no API to do this in the client. Add a SparkSession API to attach extra protos that will be sent with all requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45561) Convert TINYINT catalyst properly in MySQL Dialect
[ https://issues.apache.org/jira/browse/SPARK-45561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45561. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43390 [https://github.com/apache/spark/pull/43390] > Convert TINYINT catalyst properly in MySQL Dialect > -- > > Key: SPARK-45561 > URL: https://issues.apache.org/jira/browse/SPARK-45561 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Michael Zhang >Assignee: Michael Zhang >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > MySQL dialect currently incorrectly converts catalyst types `TINYINT` to > BYTE. However, both MySQL doesn't have a byte type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45561) Convert TINYINT catalyst properly in MySQL Dialect
[ https://issues.apache.org/jira/browse/SPARK-45561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-45561: Assignee: Michael Zhang > Convert TINYINT catalyst properly in MySQL Dialect > -- > > Key: SPARK-45561 > URL: https://issues.apache.org/jira/browse/SPARK-45561 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Michael Zhang >Assignee: Michael Zhang >Priority: Minor > Labels: pull-request-available > > MySQL dialect currently incorrectly converts catalyst types `TINYINT` to > BYTE. However, both MySQL doesn't have a byte type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written
[ https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-26052: --- Labels: bulk-closed pull-request-available (was: bulk-closed) > Spark should output a _SUCCESS file for every partition correctly written > - > > Key: SPARK-26052 > URL: https://issues.apache.org/jira/browse/SPARK-26052 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Matt Matolcsi >Priority: Minor > Labels: bulk-closed, pull-request-available > > When writing a set of partitioned Parquet files to HDFS using > dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table > after successful completion, though the actual Parquet files will end up in > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > If partitions are written out one at a time (e.g., an hourly ETL), the > _SUCCESS file is overwritten by each subsequent run and information on what > partitions were correctly written is lost. > I would like to be able to keep track of what partitions were successfully > written in HDFS. I think this could be done by writing the _SUCCESS files to > the same partition directories where the Parquet files reside, i.e., > hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ > Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I > don't think this should break partition discovery. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: Apache Spark > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: (was: Apache Spark) > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
[ https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45646: -- Assignee: (was: Apache Spark) > Remove hardcoding time variables prior to Hive 2.0 > -- > > Key: SPARK-45646 > URL: https://issues.apache.org/jira/browse/SPARK-45646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
[ https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45646: -- Assignee: Apache Spark > Remove hardcoding time variables prior to Hive 2.0 > -- > > Key: SPARK-45646 > URL: https://issues.apache.org/jira/browse/SPARK-45646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: Apache Spark > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: (was: Apache Spark) > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: Apache Spark > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45575) support time travel options for df read API
[ https://issues.apache.org/jira/browse/SPARK-45575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45575: -- Assignee: (was: Apache Spark) > support time travel options for df read API > --- > > Key: SPARK-45575 > URL: https://issues.apache.org/jira/browse/SPARK-45575 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42746) Add the LISTAGG() aggregate function
[ https://issues.apache.org/jira/browse/SPARK-42746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-42746: -- Assignee: Apache Spark > Add the LISTAGG() aggregate function > > > Key: SPARK-42746 > URL: https://issues.apache.org/jira/browse/SPARK-42746 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > {{listagg()}} is a common and useful aggregation function to concatenate > string values in a column, optionally by a certain order. The systems below > have supported such function already: > * Oracle: > [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030] > * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg] > * Amazon Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html] > * Google BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg] > Need to introduce this new aggregate in Spark, both as a regular aggregate > and as a window function. > Proposed syntax: > {code:sql} > LISTAGG( [ DISTINCT ] [, ] ) [ WITHIN GROUP ( > ) ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "is not valid external type"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adi Wehrli updated SPARK-45644: --- Description: A Spark job ran successfully with Spark 3.2.x and 3.3.x. But after upgrading to 3.4.1 (as well as with 3.5.0) the following always occurs now: {code} scala.Some is not a valid external type for schema of array {code} The corresponding stacktrace is: {code} 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch worker for task 0.0 in stage 0.0 (TID 0)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) [spark-core_2.12-3.5.0.jar:3.5.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch worker for task 1.0 in stage 0.0 (TID 1)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "is not valid external type"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adi Wehrli updated SPARK-45644: --- Description: A Spark job ran successfully with Spark 3.2.x and 3.3.x. But after upgrading to 3.4.1 (as well as with 3.5.0) the following always occurs now: {code} scala.Some is not a valid external type for schema of array {code} The corresponding stacktrace is: {code} 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch worker for task 0.0 in stage 0.0 (TID 0)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) ~[spark-core_2.12-3.5.0.jar:3.5.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) [spark-core_2.12-3.5.0.jar:3.5.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch worker for task 1.0 in stage 0.0 (TID 1)" java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) ~[spark-sql_2.12-3.5.0.jar:3.5.0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.15.jar:?] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[jira] [Resolved] (SPARK-44752) XML: Update Spark Docs
[ https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44752. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43350 [https://github.com/apache/spark/pull/43350] > XML: Update Spark Docs > -- > > Key: SPARK-44752 > URL: https://issues.apache.org/jira/browse/SPARK-44752 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: tangjiafu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://spark.apache.org/docs/latest/sql-data-sources.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44752) XML: Update Spark Docs
[ https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44752: Assignee: tangjiafu > XML: Update Spark Docs > -- > > Key: SPARK-44752 > URL: https://issues.apache.org/jira/browse/SPARK-44752 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: tangjiafu >Priority: Major > Labels: pull-request-available > > [https://spark.apache.org/docs/latest/sql-data-sources.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45641) Display the start time of an app following the Total Uptime segment on AllJobsPage
[ https://issues.apache.org/jira/browse/SPARK-45641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45641. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43495 [https://github.com/apache/spark/pull/43495] > Display the start time of an app following the Total Uptime segment on > AllJobsPage > -- > > Key: SPARK-45641 > URL: https://issues.apache.org/jira/browse/SPARK-45641 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45641) Display the start time of an app following the Total Uptime segment on AllJobsPage
[ https://issues.apache.org/jira/browse/SPARK-45641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45641: - Assignee: Kent Yao > Display the start time of an app following the Total Uptime segment on > AllJobsPage > -- > > Key: SPARK-45641 > URL: https://issues.apache.org/jira/browse/SPARK-45641 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
[ https://issues.apache.org/jira/browse/SPARK-45646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45646: --- Labels: pull-request-available (was: ) > Remove hardcoding time variables prior to Hive 2.0 > -- > > Key: SPARK-45646 > URL: https://issues.apache.org/jira/browse/SPARK-45646 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-45626: Assignee: BingKun Pan > Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE > - > > Key: SPARK-45626 > URL: https://issues.apache.org/jira/browse/SPARK-45626 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45626. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43479 [https://github.com/apache/spark/pull/43479] > Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE > - > > Key: SPARK-45626 > URL: https://issues.apache.org/jira/browse/SPARK-45626 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45646) Remove hardcoding time variables prior to Hive 2.0
Cheng Pan created SPARK-45646: - Summary: Remove hardcoding time variables prior to Hive 2.0 Key: SPARK-45646 URL: https://issues.apache.org/jira/browse/SPARK-45646 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45626) Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-45626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45626: Summary: Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE (was: Fix variable name of error-class & assign names to the error class _LEGACY_ERROR_TEMP_1055) > Convert _LEGACY_ERROR_TEMP_1055 to REQUIRES_SINGLE_PART_NAMESPACE > - > > Key: SPARK-45626 > URL: https://issues.apache.org/jira/browse/SPARK-45626 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45630) Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`
[ https://issues.apache.org/jira/browse/SPARK-45630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45630: Assignee: Yang Jie > Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace` > --- > > Key: SPARK-45630 > URL: https://issues.apache.org/jira/browse/SPARK-45630 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, YARN >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45430) FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows
[ https://issues.apache.org/jira/browse/SPARK-45430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45430. - Fix Version/s: 3.3.4 3.5.1 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 43236 [https://github.com/apache/spark/pull/43236] > FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of > rows > -- > > Key: SPARK-45430 > URL: https://issues.apache.org/jira/browse/SPARK-45430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > Labels: pull-request-available > Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2 > > > Failure when function that utilized `FramelessOffsetWindowFunctionFrame` is > used with `ignoreNulls = true` and `offset > rowCount`. > e.g. > ``` > select x, lead(x, 5) IGNORE NULLS over (order by x) from (select > explode(sequence(1, 3)) x) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45430) FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows
[ https://issues.apache.org/jira/browse/SPARK-45430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45430: --- Assignee: Vitalii Li > FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of > rows > -- > > Key: SPARK-45430 > URL: https://issues.apache.org/jira/browse/SPARK-45430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > Labels: pull-request-available > > Failure when function that utilized `FramelessOffsetWindowFunctionFrame` is > used with `ignoreNulls = true` and `offset > rowCount`. > e.g. > ``` > select x, lead(x, 5) IGNORE NULLS over (order by x) from (select > explode(sequence(1, 3)) x) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45630) Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace`
[ https://issues.apache.org/jira/browse/SPARK-45630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45630. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43482 [https://github.com/apache/spark/pull/43482] > Replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace` > --- > > Key: SPARK-45630 > URL: https://issues.apache.org/jira/browse/SPARK-45630 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, YARN >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
[ https://issues.apache.org/jira/browse/SPARK-45645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45645: - Parent: (was: SPARK-45314) Issue Type: Improvement (was: Sub-task) > Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` > -- > > Key: SPARK-45645 > URL: https://issues.apache.org/jira/browse/SPARK-45645 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44032) Remove threeten-extra exclusion in enforceBytecodeVersion rule
[ https://issues.apache.org/jira/browse/SPARK-44032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44032: - Assignee: Dongjoon Hyun > Remove threeten-extra exclusion in enforceBytecodeVersion rule > -- > > Key: SPARK-44032 > URL: https://issues.apache.org/jira/browse/SPARK-44032 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Bowen Liang >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > > We can remove `threeten-extra` library exclusion rule because Apache Spark > 4.0.0's minimum Java is 17. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44032) Remove threeten-extra exclusion in enforceBytecodeVersion rule
[ https://issues.apache.org/jira/browse/SPARK-44032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44032. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43504 [https://github.com/apache/spark/pull/43504] > Remove threeten-extra exclusion in enforceBytecodeVersion rule > -- > > Key: SPARK-44032 > URL: https://issues.apache.org/jira/browse/SPARK-44032 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Bowen Liang >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > We can remove `threeten-extra` library exclusion rule because Apache Spark > 4.0.0's minimum Java is 17. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45642) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
[ https://issues.apache.org/jira/browse/SPARK-45642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45642: --- Labels: pull-request-available (was: ) > Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` > -- > > Key: SPARK-45642 > URL: https://issues.apache.org/jira/browse/SPARK-45642 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
[ https://issues.apache.org/jira/browse/SPARK-45645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan resolved SPARK-45645. - Resolution: Duplicate > Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` > -- > > Key: SPARK-45645 > URL: https://issues.apache.org/jira/browse/SPARK-45645 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45645) Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated`
BingKun Pan created SPARK-45645: --- Summary: Fix `FileSystem.isFile & FileSystem.isDirectory is deprecated` Key: SPARK-45645 URL: https://issues.apache.org/jira/browse/SPARK-45645 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-45632: - Assignee: XiDuo You > Table cache should avoid unnecessary ColumnarToRow when enable AQE > -- > > Key: SPARK-45632 > URL: https://issues.apache.org/jira/browse/SPARK-45632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > > If the cache serializer supports columnar input, then we do not need a > ColumnarToRow before cache data. This pr improves the optimization with AQE > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45632. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43484 [https://github.com/apache/spark/pull/43484] > Table cache should avoid unnecessary ColumnarToRow when enable AQE > -- > > Key: SPARK-45632 > URL: https://issues.apache.org/jira/browse/SPARK-45632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If the cache serializer supports columnar input, then we do not need a > ColumnarToRow before cache data. This pr improves the optimization with AQE > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org