[jira] [Updated] (SPARK-46769) Fix inferring of TIMESTAMP_NTZ in CSV/JSON
[ https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46769: --- Labels: pull-request-available (was: ) > Fix inferring of TIMESTAMP_NTZ in CSV/JSON > -- > > Key: SPARK-46769 > URL: https://issues.apache.org/jira/browse/SPARK-46769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ > type inference in CSV/JSON datasource got 2 new guards which means > TIMESTAMP_NTZ should be inferred either if: > 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or > 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`. > otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`. > Both guards are unnecessary because: > 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark > should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. > Both parser are applicable for parsing `TIMESTAMP_NTZ`. > 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean > that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try > to parse the timestamp string value w/o time zone like > `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like > `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for > sure._ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46769) Fix inferring of TIMESTAMP_NTZ in CSV/JSON
Max Gekk created SPARK-46769: Summary: Fix inferring of TIMESTAMP_NTZ in CSV/JSON Key: SPARK-46769 URL: https://issues.apache.org/jira/browse/SPARK-46769 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ type inference in CSV/JSON datasource got 2 new guards which means TIMESTAMP_NTZ should be inferred either if: 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`. otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`. Both guards are unnecessary because: 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. Both parser are applicable for parsing `TIMESTAMP_NTZ`. 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try to parse the timestamp string value w/o time zone like `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for sure._ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46765) make `shuffle` specify the datatype of `seed`
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46765. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44793 [https://github.com/apache/spark/pull/44793] > make `shuffle` specify the datatype of `seed` > - > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46765) make `shuffle` specify the datatype of `seed`
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-46765: - Assignee: Ruifeng Zheng > make `shuffle` specify the datatype of `seed` > - > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46768) Upgrade the Guava version used by the connect module to 33.0-jre
[ https://issues.apache.org/jira/browse/SPARK-46768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46768: --- Labels: pull-request-available (was: ) > Upgrade the Guava version used by the connect module to 33.0-jre > > > Key: SPARK-46768 > URL: https://issues.apache.org/jira/browse/SPARK-46768 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46768) Upgrade the Guava version used by the connect module to 33.0-jre
Yang Jie created SPARK-46768: Summary: Upgrade the Guava version used by the connect module to 33.0-jre Key: SPARK-46768 URL: https://issues.apache.org/jira/browse/SPARK-46768 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46767) Refine docstring of `abs/acos/acosh`
[ https://issues.apache.org/jira/browse/SPARK-46767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46767: --- Labels: pull-request-available (was: ) > Refine docstring of `abs/acos/acosh` > > > Key: SPARK-46767 > URL: https://issues.apache.org/jira/browse/SPARK-46767 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46767) Refine docstring of `abs/acos/acosh`
Yang Jie created SPARK-46767: Summary: Refine docstring of `abs/acos/acosh` Key: SPARK-46767 URL: https://issues.apache.org/jira/browse/SPARK-46767 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46765) make `shuffle` specify the datatype of `seed`
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46765: --- Labels: pull-request-available (was: ) > make `shuffle` specify the datatype of `seed` > - > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46765) make `shuffle` specify the datatype of `seed`
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-46765: -- Summary: make `shuffle` specify the datatype of `seed` (was: Support upcasting for unregistered functions) > make `shuffle` specify the datatype of `seed` > - > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46765) Support upcasting for unregistered functions
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-46765: -- Priority: Major (was: Minor) > Support upcasting for unregistered functions > > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46765) Support upcasting for unregistered functions
[ https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-46765: -- Summary: Support upcasting for unregistered functions (was: make `shuffle` specify the datatype of `seed`) > Support upcasting for unregistered functions > > > Key: SPARK-46765 > URL: https://issues.apache.org/jira/browse/SPARK-46765 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource
[ https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46766: --- Labels: pull-request-available (was: ) > ZSTD Buffer Pool Support For AVRO datasource > > > Key: SPARK-46766 > URL: https://issues.apache.org/jira/browse/SPARK-46766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource
Kent Yao created SPARK-46766: Summary: ZSTD Buffer Pool Support For AVRO datasource Key: SPARK-46766 URL: https://issues.apache.org/jira/browse/SPARK-46766 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-46676: Assignee: Jungtaek Lim > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-46676. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44688 [https://github.com/apache/spark/pull/44688] > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46765) make `shuffle` specify the datatype of `seed`
Ruifeng Zheng created SPARK-46765: - Summary: make `shuffle` specify the datatype of `seed` Key: SPARK-46765 URL: https://issues.apache.org/jira/browse/SPARK-46765 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46764) Reorganize Ruby script to build API docs
[ https://issues.apache.org/jira/browse/SPARK-46764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46764: --- Labels: pull-request-available (was: ) > Reorganize Ruby script to build API docs > > > Key: SPARK-46764 > URL: https://issues.apache.org/jira/browse/SPARK-46764 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46764) Reorganize Ruby script to build API docs
Nicholas Chammas created SPARK-46764: Summary: Reorganize Ruby script to build API docs Key: SPARK-46764 URL: https://issues.apache.org/jira/browse/SPARK-46764 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes
Nikhil Sheoran created SPARK-46763: -- Summary: ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes Key: SPARK-46763 URL: https://issues.apache.org/jira/browse/SPARK-46763 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.5.0 Reporter: Nikhil Sheoran -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808393#comment-17808393 ] Rob Russo commented on SPARK-45282: --- Is it possible that this also affects spark 3.3.2? I have an application that has been running on spark 3.3.2 and with AQE enabled. When I upgraded to 3.5.0 I immediately ran into the issue in this ticket. However when I started looking more closely I found that for 1 particular type of report the issue was still present even after rolling back to 3.3.2 with AQE enabled. Either way on 3.3.2 or 3.5.0, disabling AQE fixed the problem. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Assignee: Emil Ejbyfeldt >Priority: Blocker > Labels: CorrectnessBug, correctness, pull-request-available > Fix For: 3.4.2 > > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46762) Spark Connect 3.5 Classloading issue
[ https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-46762: Description: *Affected version:* spark 3.5 and spark-connect_2.12:3.5.0 *Not affected version and variation:* Spark 3.4 and spark-connect_2.12:3.4.0 Also works with just Spark 3.5 spark-submit script directly (ie without using spark-connect 3.5) We are having following `java.lang.ClassCastException` error in spark Executors when using spark-connect 3.5 with external spark sql catalog jar - iceberg-spark-runtime-3.5_2.12-1.4.3.jar We also set "spark.executor.userClassPathFirst=true" {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2): java.lang.ClassCastException: class org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to class org.apache.iceberg.Table (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; org.apache.iceberg.Table is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) at org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) at org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) at org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) at org.apach...{code} We verified that there's only one jar of `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is started. Issue has been open with Iceberg as well: [https://github.com/apache/iceberg/issues/8978] And being discussed in mail archive: [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1] Looking more into Error it seems classloader itself is instantiated multiple times somewhere. I can see two instances: org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 Again this issue doesn't happen with spark-connect 3.4 and doesn't happen with directly using spark3.5 without spark-connect 3.5 was: *Affected version:* spark 3.5 and spark-connect_2.12:3.5.0 *Not affected version and variation:* Spark 3.4 and spark-connect_2.12:3.4.0 Also works with just Spark 3.5 spark-submit script directly (ie without using spark-connect 3.5) We are having following `java.lang.ClassCastException` error in spark Executors when using spark-connect 3.5 with external spark sql catalog jar - iceberg-spark-runtime-3.5_2.12-1.4.3.jar {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2): java.lang.ClassCastException: class org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to class org.apache.iceberg.Table (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; org.apache.iceberg.Table is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) at org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) at org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) at
[jira] [Created] (SPARK-46762) Spark Connect 3.5 Classloading issue
nirav patel created SPARK-46762: --- Summary: Spark Connect 3.5 Classloading issue Key: SPARK-46762 URL: https://issues.apache.org/jira/browse/SPARK-46762 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: nirav patel *Affected version:* spark 3.5 and spark-connect_2.12:3.5.0 *Not affected version and variation:* Spark 3.4 and spark-connect_2.12:3.4.0 Also works with just Spark 3.5 spark-submit script directly (ie without using spark-connect 3.5) We are having following `java.lang.ClassCastException` error in spark Executors when using spark-connect 3.5 with external spark sql catalog jar - iceberg-spark-runtime-3.5_2.12-1.4.3.jar {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2): java.lang.ClassCastException: class org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to class org.apache.iceberg.Table (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; org.apache.iceberg.Table is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) at org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) at org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) at org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) at org.apach...{code} We verified that there's only one jar of `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is started. Issue has been open with Iceberg as well: [https://github.com/apache/iceberg/issues/8978] And being discussed in mail archive: [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1] Looking more into this issue it seems classloader itself is instantiated multiple times somewhere. I can see two instances: org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46762) Spark Connect 3.5 Classloading issue
[ https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-46762: Description: *Affected version:* spark 3.5 and spark-connect_2.12:3.5.0 *Not affected version and variation:* Spark 3.4 and spark-connect_2.12:3.4.0 Also works with just Spark 3.5 spark-submit script directly (ie without using spark-connect 3.5) We are having following `java.lang.ClassCastException` error in spark Executors when using spark-connect 3.5 with external spark sql catalog jar - iceberg-spark-runtime-3.5_2.12-1.4.3.jar {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2): java.lang.ClassCastException: class org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to class org.apache.iceberg.Table (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; org.apache.iceberg.Table is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) at org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) at org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) at org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) at org.apach...{code} We verified that there's only one jar of `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is started. Issue has been open with Iceberg as well: [https://github.com/apache/iceberg/issues/8978] And being discussed in mail archive: [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1] Looking more into this issue it seems classloader itself is instantiated multiple times somewhere. I can see two instances: org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 was: *Affected version:* spark 3.5 and spark-connect_2.12:3.5.0 *Not affected version and variation:* Spark 3.4 and spark-connect_2.12:3.4.0 Also works with just Spark 3.5 spark-submit script directly (ie without using spark-connect 3.5) We are having following `java.lang.ClassCastException` error in spark Executors when using spark-connect 3.5 with external spark sql catalog jar - iceberg-spark-runtime-3.5_2.12-1.4.3.jar {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2): java.lang.ClassCastException: class org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to class org.apache.iceberg.Table (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; org.apache.iceberg.Table is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) at org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) at org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) at org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45) at
[jira] [Created] (SPARK-46761) quoted strings in a JSON path should support ? characters
Robert Joseph Evans created SPARK-46761: --- Summary: quoted strings in a JSON path should support ? characters Key: SPARK-46761 URL: https://issues.apache.org/jira/browse/SPARK-46761 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 4.0.0 Reporter: Robert Joseph Evans I think this impacts all versions of Spark after SPARK-18677, which made the operator work at all in 2.1.0/2.0.3 I comes down to {code:java} name <- '.' ~> "[^\\.\\[]+".r | "['" ~> "[^\\'\\?]+".r <~ "']"{code} [https://github.com/apache/spark/blob/01bb1b1a3dbfc68f41d9b13de863d26d587c7e2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L79] The regular expression/pattern is saying that we want a [' followed by one or more characters that are not a single quote ' or a question mark ? followed by ']. That question mark looks out of place. When I try to put in a question mark in a quoted string it fails to produce any result, but when I put the same data/path into [https://jsonpath.com/] I get a result data {code:java} {"?":"QUESTION"} {code} path {code:java} $['?'] {code} I also see no tests validating that a question mark is not allowed so I suspect that it is a long standing bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46759) Codec xz and zstandard support compression level for avro files
[ https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46759. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44786 [https://github.com/apache/spark/pull/44786] > Codec xz and zstandard support compression level for avro files > --- > > Key: SPARK-46759 > URL: https://issues.apache.org/jira/browse/SPARK-46759 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46759) Codec xz and zstandard support compression level for avro files
[ https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46759: - Assignee: Kent Yao > Codec xz and zstandard support compression level for avro files > --- > > Key: SPARK-46759 > URL: https://issues.apache.org/jira/browse/SPARK-46759 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808235#comment-17808235 ] Никита Соколов commented on SPARK-46247: No, there was no trailing dot at the end of the filenames, it is from an exception. The file is invalid because of a -5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet suffix. BucketingUtils fails to extract the bucket id when it is there. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L34C31-L34C31] Is this enough? If not, then I will come back with the whole stacktrace a bit later. Should I use the s3a-prefix in the path option or some configurations? > Invalid bucket file error when reading from bucketed table created with > PathOutputCommitProtocol > > > Key: SPARK-46247 > URL: https://issues.apache.org/jira/browse/SPARK-46247 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > > I am trying to create an external partioned bucketed table using this code: > {code:java} > spark.read.parquet("s3://faucct/input") > .repartition(128, col("product_id")) > .write.partitionBy("features_date").bucketBy(128, "product_id") > .option("path", "s3://faucct/tmp/output") > .option("compression", "uncompressed") > .saveAsTable("tmp.output"){code} > At first it took more time than expected because it had to rename a lot of > files in the end, which requires copying in S3. But I have used the > configuration from the documentation – > [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]: > {code:java} > spark.hadoop.fs.s3a.committer.name directory > spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code} > It is properly partitioned: every partition_date has exactly 128 files named > like > [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet]. > Then I am trying to join this table with another one, for example like this: > {code:java} > spark.table("tmp.output").repartition(128, $"product_id") > .join(spark.table("tmp.output").repartition(128, $"product_id"), > Seq("product_id")).count(){code} > Because of the configuration I get the following errors: > {code:java} > org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: > s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731) > at > org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808227#comment-17808227 ] Steve Loughran commented on SPARK-46247: why is the file invalid? any more stack trace? # try using s3a:// as the prefix all the way through # is there really a "." at the end of the filenames. The directory committer was netflix's design for incremental update of an existing table, where a partition could be deleted before new data was committed. unless you want to do this, use the magic or (second best) staging committer > Invalid bucket file error when reading from bucketed table created with > PathOutputCommitProtocol > > > Key: SPARK-46247 > URL: https://issues.apache.org/jira/browse/SPARK-46247 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > > I am trying to create an external partioned bucketed table using this code: > {code:java} > spark.read.parquet("s3://faucct/input") > .repartition(128, col("product_id")) > .write.partitionBy("features_date").bucketBy(128, "product_id") > .option("path", "s3://faucct/tmp/output") > .option("compression", "uncompressed") > .saveAsTable("tmp.output"){code} > At first it took more time than expected because it had to rename a lot of > files in the end, which requires copying in S3. But I have used the > configuration from the documentation – > [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]: > {code:java} > spark.hadoop.fs.s3a.committer.name directory > spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code} > It is properly partitioned: every partition_date has exactly 128 files named > like > [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet]. > Then I am trying to join this table with another one, for example like this: > {code:java} > spark.table("tmp.output").repartition(128, $"product_id") > .join(spark.table("tmp.output").repartition(128, $"product_id"), > Seq("product_id")).count(){code} > Because of the configuration I get the following errors: > {code:java} > org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: > s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731) > at > org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46759) Codec xz and zstandard support compression level for avro files
[ https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46759: --- Labels: pull-request-available (was: ) > Codec xz and zstandard support compression level for avro files > --- > > Key: SPARK-46759 > URL: https://issues.apache.org/jira/browse/SPARK-46759 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46760) Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer
[ https://issues.apache.org/jira/browse/SPARK-46760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46760: --- Labels: pull-request-available (was: ) > Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst > clearer > --- > > Key: SPARK-46760 > URL: https://issues.apache.org/jira/browse/SPARK-46760 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46760) Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer
Jiaan Geng created SPARK-46760: -- Summary: Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer Key: SPARK-46760 URL: https://issues.apache.org/jira/browse/SPARK-46760 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46759) Codec xz and zstandard support compression level for avro files
Kent Yao created SPARK-46759: Summary: Codec xz and zstandard support compression level for avro files Key: SPARK-46759 URL: https://issues.apache.org/jira/browse/SPARK-46759 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)
[ https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-39910: -- Assignee: Apache Spark > DataFrameReader API cannot read files from hadoop archives (.har) > - > > Key: SPARK-39910 > URL: https://issues.apache.org/jira/browse/SPARK-39910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Christophe Préaud >Assignee: Apache Spark >Priority: Minor > Labels: DataFrameReader, pull-request-available > > Reading a file from an hadoop archive using the DataFrameReader API returns > an empty Dataset: > {code:java} > scala> val df = > spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") > df: org.apache.spark.sql.Dataset[String] = [value: string] > scala> df.count > res7: Long = 0 {code} > > On the other hand, reading the same file, from the same hadoop archive, but > using the RDD API yields the correct result: > {code:java} > scala> val df = > sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") > df: org.apache.spark.sql.DataFrame = [value: string] > scala> df.count > res8: Long = 5589 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)
[ https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-39910: -- Assignee: (was: Apache Spark) > DataFrameReader API cannot read files from hadoop archives (.har) > - > > Key: SPARK-39910 > URL: https://issues.apache.org/jira/browse/SPARK-39910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Christophe Préaud >Priority: Minor > Labels: DataFrameReader, pull-request-available > > Reading a file from an hadoop archive using the DataFrameReader API returns > an empty Dataset: > {code:java} > scala> val df = > spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") > df: org.apache.spark.sql.Dataset[String] = [value: string] > scala> df.count > res7: Long = 0 {code} > > On the other hand, reading the same file, from the same hadoop archive, but > using the RDD API yields the correct result: > {code:java} > scala> val df = > sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") > df: org.apache.spark.sql.DataFrame = [value: string] > scala> df.count > res8: Long = 5589 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808080#comment-17808080 ] Mridul Muralidharan commented on SPARK-46623: - Issue resolved by pull request 44616 https://github.com/apache/spark/pull/44616 > Replace SimpleDateFormat with DateTimeFormatter > --- > > Key: SPARK-46623 > URL: https://issues.apache.org/jira/browse/SPARK-46623 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-46623. - Fix Version/s: 4.0.0 Assignee: Jiaan Geng Resolution: Fixed > Replace SimpleDateFormat with DateTimeFormatter > --- > > Key: SPARK-46623 > URL: https://issues.apache.org/jira/browse/SPARK-46623 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.
[ https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-46696. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44705 [https://github.com/apache/spark/pull/44705] > In ResourceProfileManager, function calls should occur after variable > declarations. > --- > > Key: SPARK-46696 > URL: https://issues.apache.org/jira/browse/SPARK-46696 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: liangyongyuan >Assignee: liangyongyuan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > As the title suggests, in *ResourceProfileManager*, function calls should be > made after variable declarations. When determining *isSupport*, all variables > are uninitialized, with booleans defaulting to false and objects to null. > While the end result is correct, the evaluation process is abnormal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.
[ https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-46696: --- Assignee: liangyongyuan > In ResourceProfileManager, function calls should occur after variable > declarations. > --- > > Key: SPARK-46696 > URL: https://issues.apache.org/jira/browse/SPARK-46696 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: liangyongyuan >Assignee: liangyongyuan >Priority: Major > Labels: pull-request-available > > As the title suggests, in *ResourceProfileManager*, function calls should be > made after variable declarations. When determining *isSupport*, all variables > are uninitialized, with booleans defaulting to false and objects to null. > While the end result is correct, the evaluation process is abnormal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46754) Fix compression code resolution in avro table definition
[ https://issues.apache.org/jira/browse/SPARK-46754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46754. -- Fix Version/s: 4.0.0 Assignee: Kent Yao Resolution: Fixed resolved by [GitHub Pull Request #44780|https://github.com/apache/spark/pull/44780] > Fix compression code resolution in avro table definition > > > Key: SPARK-46754 > URL: https://issues.apache.org/jira/browse/SPARK-46754 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > one for case insensitiveness and the other for correctly handling invalid > codec names -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46708) Support error message format in Spark Connect service
[ https://issues.apache.org/jira/browse/SPARK-46708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46708: --- Labels: pull-request-available (was: ) > Support error message format in Spark Connect service > - > > Key: SPARK-46708 > URL: https://issues.apache.org/jira/browse/SPARK-46708 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Garland Zhang >Priority: Major > Labels: pull-request-available > > * spark connect does not properly support {{spark.sql.error.messageFormat}} > which means spark connect exception messages don't change based on the > format. > * we need to add this parity to spark connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org