[jira] [Updated] (SPARK-46769) Fix inferring of TIMESTAMP_NTZ in CSV/JSON

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46769:
---
Labels: pull-request-available  (was: )

> Fix inferring of TIMESTAMP_NTZ in CSV/JSON
> --
>
> Key: SPARK-46769
> URL: https://issues.apache.org/jira/browse/SPARK-46769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ 
> type inference in CSV/JSON datasource got 2 new guards which means 
> TIMESTAMP_NTZ should be inferred either if:
> 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
> 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.
> otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.
> Both guards are unnecessary because:
> 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
> should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
> Both parser are applicable for parsing `TIMESTAMP_NTZ`.
> 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
> that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try 
> to parse the timestamp string value w/o time zone like 
> `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like 
> `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for 
> sure._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46769) Fix inferring of TIMESTAMP_NTZ in CSV/JSON

2024-01-18 Thread Max Gekk (Jira)
Max Gekk created SPARK-46769:


 Summary: Fix inferring of TIMESTAMP_NTZ in CSV/JSON
 Key: SPARK-46769
 URL: https://issues.apache.org/jira/browse/SPARK-46769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ type 
inference in CSV/JSON datasource got 2 new guards which means TIMESTAMP_NTZ 
should be inferred either if:

1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.

otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.

Both guards are unnecessary because:

1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
Both parser are applicable for parsing `TIMESTAMP_NTZ`.
2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try to 
parse the timestamp string value w/o time zone like `2024-01-19T09:10:11.123` 
using a LTZ format **with timezone** like `-MM-dd'T'HH:mm:ss.SSSXXX`. _The 
last one cannot match any NTZ values for sure._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46765) make `shuffle` specify the datatype of `seed`

2024-01-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-46765.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44793
[https://github.com/apache/spark/pull/44793]

> make `shuffle` specify the datatype of `seed`
> -
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46765) make `shuffle` specify the datatype of `seed`

2024-01-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-46765:
-

Assignee: Ruifeng Zheng

> make `shuffle` specify the datatype of `seed`
> -
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46768) Upgrade the Guava version used by the connect module to 33.0-jre

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46768:
---
Labels: pull-request-available  (was: )

> Upgrade the Guava version used by the connect module to 33.0-jre
> 
>
> Key: SPARK-46768
> URL: https://issues.apache.org/jira/browse/SPARK-46768
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46768) Upgrade the Guava version used by the connect module to 33.0-jre

2024-01-18 Thread Yang Jie (Jira)
Yang Jie created SPARK-46768:


 Summary: Upgrade the Guava version used by the connect module to 
33.0-jre
 Key: SPARK-46768
 URL: https://issues.apache.org/jira/browse/SPARK-46768
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46767) Refine docstring of `abs/acos/acosh`

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46767:
---
Labels: pull-request-available  (was: )

> Refine docstring of `abs/acos/acosh`
> 
>
> Key: SPARK-46767
> URL: https://issues.apache.org/jira/browse/SPARK-46767
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46767) Refine docstring of `abs/acos/acosh`

2024-01-18 Thread Yang Jie (Jira)
Yang Jie created SPARK-46767:


 Summary: Refine docstring of `abs/acos/acosh`
 Key: SPARK-46767
 URL: https://issues.apache.org/jira/browse/SPARK-46767
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46765) make `shuffle` specify the datatype of `seed`

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46765:
---
Labels: pull-request-available  (was: )

> make `shuffle` specify the datatype of `seed`
> -
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46765) make `shuffle` specify the datatype of `seed`

2024-01-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-46765:
--
Summary: make `shuffle` specify the datatype of `seed`  (was: Support 
upcasting for unregistered functions)

> make `shuffle` specify the datatype of `seed`
> -
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46765) Support upcasting for unregistered functions

2024-01-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-46765:
--
Priority: Major  (was: Minor)

> Support upcasting for unregistered functions
> 
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46765) Support upcasting for unregistered functions

2024-01-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-46765:
--
Summary: Support upcasting for unregistered functions  (was: make `shuffle` 
specify the datatype of `seed`)

> Support upcasting for unregistered functions
> 
>
> Key: SPARK-46765
> URL: https://issues.apache.org/jira/browse/SPARK-46765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46766:
---
Labels: pull-request-available  (was: )

> ZSTD Buffer Pool Support For AVRO datasource
> 
>
> Key: SPARK-46766
> URL: https://issues.apache.org/jira/browse/SPARK-46766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource

2024-01-18 Thread Kent Yao (Jira)
Kent Yao created SPARK-46766:


 Summary: ZSTD Buffer Pool Support For AVRO datasource
 Key: SPARK-46766
 URL: https://issues.apache.org/jira/browse/SPARK-46766
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-46676:


Assignee: Jungtaek Lim

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-46676.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44688
[https://github.com/apache/spark/pull/44688]

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46765) make `shuffle` specify the datatype of `seed`

2024-01-18 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-46765:
-

 Summary: make `shuffle` specify the datatype of `seed`
 Key: SPARK-46765
 URL: https://issues.apache.org/jira/browse/SPARK-46765
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46764) Reorganize Ruby script to build API docs

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46764:
---
Labels: pull-request-available  (was: )

> Reorganize Ruby script to build API docs
> 
>
> Key: SPARK-46764
> URL: https://issues.apache.org/jira/browse/SPARK-46764
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46764) Reorganize Ruby script to build API docs

2024-01-18 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46764:


 Summary: Reorganize Ruby script to build API docs
 Key: SPARK-46764
 URL: https://issues.apache.org/jira/browse/SPARK-46764
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes

2024-01-18 Thread Nikhil Sheoran (Jira)
Nikhil Sheoran created SPARK-46763:
--

 Summary: ReplaceDeduplicateWithAggregate fails when non-grouping 
keys have duplicate attributes
 Key: SPARK-46763
 URL: https://issues.apache.org/jira/browse/SPARK-46763
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.5.0
Reporter: Nikhil Sheoran






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45282) Join loses records for cached datasets

2024-01-18 Thread Rob Russo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808393#comment-17808393
 ] 

Rob Russo commented on SPARK-45282:
---

Is it possible that this also affects spark 3.3.2? I have an application that 
has been running on spark 3.3.2 and with AQE enabled. When I upgraded to 3.5.0 
I immediately ran into the issue in this ticket. However when I started looking 
more closely I found that for 1 particular type of report the issue was still 
present even after rolling back to 3.3.2 with AQE enabled.

Either way on 3.3.2 or 3.5.0, disabling AQE fixed the problem.

> Join loses records for cached datasets
> --
>
> Key: SPARK-45282
> URL: https://issues.apache.org/jira/browse/SPARK-45282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
> Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or 
> databricks 13.3
>Reporter: koert kuipers
>Assignee: Emil Ejbyfeldt
>Priority: Blocker
>  Labels: CorrectnessBug, correctness, pull-request-available
> Fix For: 3.4.2
>
>
> we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is 
> not present on spark 3.3.1.
> it only shows up in distributed environment. i cannot replicate in unit test. 
> however i did get it to show up on hadoop cluster, kubernetes, and on 
> databricks 13.3
> the issue is that records are dropped when two cached dataframes are joined. 
> it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an 
> optimization while in spark 3.3.1 these Exhanges are still present. it seems 
> to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true.
> to reproduce on distributed cluster these settings needed are:
> {code:java}
> spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432
> spark.sql.adaptive.coalescePartitions.parallelismFirst false
> spark.sql.adaptive.enabled true
> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code}
> code using scala to reproduce is:
> {code:java}
> import java.util.UUID
> import org.apache.spark.sql.functions.col
> import spark.implicits._
> val data = (1 to 100).toDS().map(i => 
> UUID.randomUUID().toString).persist()
> val left = data.map(k => (k, 1))
> val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works!
> println("number of left " + left.count())
> println("number of right " + right.count())
> println("number of (left join right) " +
>   left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count()
> )
> val left1 = left
>   .toDF("key", "value1")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of left1 " + left1.count())
> val right1 = right
>   .toDF("key", "value2")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of right1 " + right1.count())
> println("number of (left1 join right1) " +  left1.join(right1, 
> "key").count()) // this gives incorrect result{code}
> this produces the following output:
> {code:java}
> number of left 100
> number of right 100
> number of (left join right) 100
> number of left1 100
> number of right1 100
> number of (left1 join right1) 859531 {code}
> note that the last number (the incorrect one) actually varies depending on 
> settings and cluster size etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46762) Spark Connect 3.5 Classloading issue

2024-01-18 Thread nirav patel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-46762:

Description: 
*Affected version:*

spark 3.5 and spark-connect_2.12:3.5.0

 

*Not affected version and variation:*

Spark 3.4 and spark-connect_2.12:3.4.0

Also works with just Spark 3.5 spark-submit script directly (ie without using 
spark-connect 3.5)

 

We are having following `java.lang.ClassCastException` error in spark Executors 
when using spark-connect 3.5 with external spark sql catalog jar - 
iceberg-spark-runtime-3.5_2.12-1.4.3.jar

We also set "spark.executor.userClassPathFirst=true" 

 
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3) (spark35-m.c.strivr-dev-test.internal executor 2): 
java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
class org.apache.iceberg.Table 
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module 
of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
org.apache.iceberg.Table is in unnamed module of loader 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
    at 
org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
    at 
org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
    at 
org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apach...{code}
 

 

We verified that there's only one jar of 
`iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is 
started. 

Issue has been open with Iceberg as well: 
[https://github.com/apache/iceberg/issues/8978]

And being discussed in mail archive: 
[https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]

 

Looking more into Error it seems classloader itself is instantiated multiple 
times somewhere. I can see two instances: 
org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 

 

Again this issue doesn't happen with spark-connect 3.4 and doesn't happen with 
directly using spark3.5 without spark-connect 3.5

  was:
*Affected version:*

spark 3.5 and spark-connect_2.12:3.5.0

 

*Not affected version and variation:*

Spark 3.4 and spark-connect_2.12:3.4.0

Also works with just Spark 3.5 spark-submit script directly (ie without using 
spark-connect 3.5)

 

We are having following `java.lang.ClassCastException` error in spark Executors 
when using spark-connect 3.5 with external spark sql catalog jar - 
iceberg-spark-runtime-3.5_2.12-1.4.3.jar

 

 
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3) (spark35-m.c.strivr-dev-test.internal executor 2): 
java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
class org.apache.iceberg.Table 
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module 
of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
org.apache.iceberg.Table is in unnamed module of loader 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
    at 
org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
    at 
org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
    at 

[jira] [Created] (SPARK-46762) Spark Connect 3.5 Classloading issue

2024-01-18 Thread nirav patel (Jira)
nirav patel created SPARK-46762:
---

 Summary: Spark Connect 3.5 Classloading issue
 Key: SPARK-46762
 URL: https://issues.apache.org/jira/browse/SPARK-46762
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: nirav patel


*Affected version:*

spark 3.5 and spark-connect_2.12:3.5.0

 

*Not affected version and variation:*

Spark 3.4 and spark-connect_2.12:3.4.0

Also works with just Spark 3.5 spark-submit script directly (ie without using 
spark-connect 3.5)

 

We are having following `java.lang.ClassCastException` error in spark Executors 
when using spark-connect 3.5 with external spark sql catalog jar - 
iceberg-spark-runtime-3.5_2.12-1.4.3.jar

 

 
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3) (spark35-m.c.strivr-dev-test.internal executor 2): 
java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
class org.apache.iceberg.Table 
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module 
of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
org.apache.iceberg.Table is in unnamed module of loader 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
    at 
org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
    at 
org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
    at 
org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apach...{code}
 

 

We verified that there's only one jar of 
`iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is 
started. 

Issue has been open with Iceberg as well: 
[https://github.com/apache/iceberg/issues/8978]

And being discussed in mail archive: 
[https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]

 

Looking more into this issue it seems classloader itself is instantiated 
multiple times somewhere. I can see two instances: 
org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46762) Spark Connect 3.5 Classloading issue

2024-01-18 Thread nirav patel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-46762:

Description: 
*Affected version:*

spark 3.5 and spark-connect_2.12:3.5.0

 

*Not affected version and variation:*

Spark 3.4 and spark-connect_2.12:3.4.0

Also works with just Spark 3.5 spark-submit script directly (ie without using 
spark-connect 3.5)

 

We are having following `java.lang.ClassCastException` error in spark Executors 
when using spark-connect 3.5 with external spark sql catalog jar - 
iceberg-spark-runtime-3.5_2.12-1.4.3.jar

 

 
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3) (spark35-m.c.strivr-dev-test.internal executor 2): 
java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
class org.apache.iceberg.Table 
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module 
of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
org.apache.iceberg.Table is in unnamed module of loader 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
    at 
org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
    at 
org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
    at 
org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
    at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apach...{code}
 

 

We verified that there's only one jar of 
`iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server is 
started. 

Issue has been open with Iceberg as well: 
[https://github.com/apache/iceberg/issues/8978]

And being discussed in mail archive: 
[https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]

 

Looking more into this issue it seems classloader itself is instantiated 
multiple times somewhere. I can see two instances: 
org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943

  was:
*Affected version:*

spark 3.5 and spark-connect_2.12:3.5.0

 

*Not affected version and variation:*

Spark 3.4 and spark-connect_2.12:3.4.0

Also works with just Spark 3.5 spark-submit script directly (ie without using 
spark-connect 3.5)

 

We are having following `java.lang.ClassCastException` error in spark Executors 
when using spark-connect 3.5 with external spark sql catalog jar - 
iceberg-spark-runtime-3.5_2.12-1.4.3.jar

 

 
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3) (spark35-m.c.strivr-dev-test.internal executor 2): 
java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
class org.apache.iceberg.Table 
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed module 
of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
org.apache.iceberg.Table is in unnamed module of loader 
org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
    at 
org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
    at 
org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
    at 
org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
    at 

[jira] [Created] (SPARK-46761) quoted strings in a JSON path should support ? characters

2024-01-18 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-46761:
---

 Summary: quoted strings in a JSON path should support ? characters
 Key: SPARK-46761
 URL: https://issues.apache.org/jira/browse/SPARK-46761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 4.0.0
Reporter: Robert Joseph Evans


I think this impacts all versions of Spark after SPARK-18677, which made the 
operator work at all in 2.1.0/2.0.3

I comes down to
{code:java}
 name <- '.' ~> "[^\\.\\[]+".r | "['" ~> "[^\\'\\?]+".r <~ "']"{code}
[https://github.com/apache/spark/blob/01bb1b1a3dbfc68f41d9b13de863d26d587c7e2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L79]

 

The regular expression/pattern is saying that we want a [' followed by one or 
more characters that are not a single quote ' or a question mark ? followed by 
']. That question mark looks out of place. When I try to put in a question mark 
in a quoted string it fails to produce any result, but when I put the same 
data/path into [https://jsonpath.com/] I get a result

 

data
{code:java}
{"?":"QUESTION"} {code}
path
{code:java}
$['?'] {code}
 

I also see no tests validating that a question mark is not allowed so I suspect 
that it is a long standing bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46759) Codec xz and zstandard support compression level for avro files

2024-01-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46759.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44786
[https://github.com/apache/spark/pull/44786]

> Codec xz and zstandard support compression level for avro files
> ---
>
> Key: SPARK-46759
> URL: https://issues.apache.org/jira/browse/SPARK-46759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46759) Codec xz and zstandard support compression level for avro files

2024-01-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46759:
-

Assignee: Kent Yao

> Codec xz and zstandard support compression level for avro files
> ---
>
> Key: SPARK-46759
> URL: https://issues.apache.org/jira/browse/SPARK-46759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol

2024-01-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808235#comment-17808235
 ] 

Никита Соколов commented on SPARK-46247:


No, there was no trailing dot at the end of the filenames, it is from an 
exception. The file is invalid because of a 
-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet suffix. BucketingUtils fails 
to extract the bucket id when it is there.
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L34C31-L34C31]

Is this enough? If not, then I will come back with the whole stacktrace a bit 
later.

Should I use the s3a-prefix in the path option or some configurations?

 

> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> 
>
> Key: SPARK-46247
> URL: https://issues.apache.org/jira/browse/SPARK-46247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol

2024-01-18 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808227#comment-17808227
 ] 

Steve Loughran commented on SPARK-46247:


why is the file invalid? any more stack trace?

# try using s3a:// as the prefix all the way through
# is there really a "." at the end of the filenames.

The directory committer was netflix's design for incremental update of an 
existing table, where a partition could be deleted before new data was 
committed.

unless you want to do this, use the magic or (second best) staging committer


> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> 
>
> Key: SPARK-46247
> URL: https://issues.apache.org/jira/browse/SPARK-46247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46759) Codec xz and zstandard support compression level for avro files

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46759:
---
Labels: pull-request-available  (was: )

> Codec xz and zstandard support compression level for avro files
> ---
>
> Key: SPARK-46759
> URL: https://issues.apache.org/jira/browse/SPARK-46759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46760) Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46760:
---
Labels: pull-request-available  (was: )

> Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst 
> clearer
> ---
>
> Key: SPARK-46760
> URL: https://issues.apache.org/jira/browse/SPARK-46760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46760) Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer

2024-01-18 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46760:
--

 Summary: Make the document of 
spark.sql.adaptive.coalescePartitions.parallelismFirst clearer
 Key: SPARK-46760
 URL: https://issues.apache.org/jira/browse/SPARK-46760
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46759) Codec xz and zstandard support compression level for avro files

2024-01-18 Thread Kent Yao (Jira)
Kent Yao created SPARK-46759:


 Summary: Codec xz and zstandard support compression level for avro 
files
 Key: SPARK-46759
 URL: https://issues.apache.org/jira/browse/SPARK-46759
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-39910:
--

Assignee: Apache Spark

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Assignee: Apache Spark
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-39910:
--

Assignee: (was: Apache Spark)

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808080#comment-17808080
 ] 

Mridul Muralidharan commented on SPARK-46623:
-

Issue resolved by pull request 44616
https://github.com/apache/spark/pull/44616

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46623.
-
Fix Version/s: 4.0.0
 Assignee: Jiaan Geng
   Resolution: Fixed

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46696.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44705
[https://github.com/apache/spark/pull/44705]

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46696:
---

Assignee: liangyongyuan

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46754) Fix compression code resolution in avro table definition

2024-01-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46754.
--
Fix Version/s: 4.0.0
 Assignee: Kent Yao
   Resolution: Fixed

resolved by  [GitHub Pull Request 
#44780|https://github.com/apache/spark/pull/44780]

> Fix compression code resolution in avro table definition
> 
>
> Key: SPARK-46754
> URL: https://issues.apache.org/jira/browse/SPARK-46754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> one for case insensitiveness and the other for correctly handling invalid 
> codec names



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46708) Support error message format in Spark Connect service

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46708:
---
Labels: pull-request-available  (was: )

> Support error message format in Spark Connect service
> -
>
> Key: SPARK-46708
> URL: https://issues.apache.org/jira/browse/SPARK-46708
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Garland Zhang
>Priority: Major
>  Labels: pull-request-available
>
> * spark connect does not properly support {{spark.sql.error.messageFormat}} 
> which means spark connect exception messages don't change based on the 
> format. 
>  * we need to add this parity to spark connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org