[jira] [Created] (SPARK-47621) Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`

2024-03-27 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47621:


 Summary: Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, 
`mean`
 Key: SPARK-47621
 URL: https://issues.apache.org/jira/browse/SPARK-47621
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47363) Initial State without state reader implementation for State API v2.

2024-03-27 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47363.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45467
[https://github.com/apache/spark/pull/45467]

> Initial State without state reader implementation for State API v2.
> ---
>
> Key: SPARK-47363
> URL: https://issues.apache.org/jira/browse/SPARK-47363
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Assignee: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This PR adds support for users to provide a Dataframe that can be used to 
> instantiate state for the query in the first batch for arbitrary state API v2.
> Note that populating the initial state will only happen for the first batch 
> of the new streaming query. Trying to re-initialize state for the same 
> grouping key will result in an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47287) Aggregate in not causes

2024-03-27 Thread Juefei Yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831625#comment-17831625
 ] 

Juefei Yan commented on SPARK-47287:


I tried the code on 3.4 branch, cannot reproduce this problem

> Aggregate in not causes 
> 
>
> Key: SPARK-47287
> URL: https://issues.apache.org/jira/browse/SPARK-47287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ted Chester Jenks
>Priority: Major
>
>  
> The below snippet is confirmed working with Spark 3.2.1 and broken Spark 
> 3.4.1. i believe this is a bug. 
> {code:java}
>Dataset ds = dummyDataset
> .withColumn("flag", 
> functions.not(functions.coalesce(functions.col("bool1"), 
> functions.lit(false)).equalTo(true)))
> .groupBy("code")
> .agg(functions.max(functions.col("flag")).alias("flag"));
> ds.show(); {code}
> It fails with:
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47619) Refine docstring of `to_json/from_json`

2024-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47619.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45742
[https://github.com/apache/spark/pull/45742]

> Refine docstring of `to_json/from_json`
> ---
>
> Key: SPARK-47619
> URL: https://issues.apache.org/jira/browse/SPARK-47619
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44050) Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue

2024-03-27 Thread liangyouze (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831611#comment-17831611
 ] 

liangyouze commented on SPARK-44050:


I've encountered the same issue. I'd like to create a PR to try fixing this 
issue

> Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue
> --
>
> Key: SPARK-44050
> URL: https://issues.apache.org/jira/browse/SPARK-44050
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.3.1
>Reporter: Harshwardhan Singh Dodiya
>Priority: Critical
> Attachments: image-2023-06-14-11-07-36-960.png
>
>
> Dear Spark community,
> I am facing an issue related to mounting a ConfigMap in the driver pod of my 
> Spark application. Upon investigation, I realized that the problem is caused 
> by the ConfigMap not being created successfully.
> *Problem Description:*
> When attempting to mount the ConfigMap in the driver pod, I encounter 
> consistent failures and my pod stays in containerCreating state. Upon further 
> investigation, I discovered that the ConfigMap does not exist in the 
> Kubernetes cluster, which results in the driver pod's inability to access the 
> required configuration data.
> *Additional Information:*
> I would like to highlight that this issue is not a frequent occurrence. It 
> has been observed randomly, affecting the mounting of the ConfigMap in the 
> driver pod only approximately 5% of the time. This intermittent behavior adds 
> complexity to the troubleshooting process, as it is challenging to reproduce 
> consistently.
> *Error Message:*
> when describing driver pod (kubectl describe pod pod_name)  get the below 
> error.
> "ConfigMap '' not found."
> *To Reproduce:*
> 1. Download spark 3.3.1 from [https://spark.apache.org/downloads.html]
> 2. create an image with "bin/docker-image-tool.sh"
> 3. Submit on spark-client via bash command by passing all the details and 
> configurations.
> 4. Randomly in some of the driver pod we can observe this issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47619) Refine docstring of `to_json/from_json`

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47619:
---
Labels: pull-request-available  (was: )

> Refine docstring of `to_json/from_json`
> ---
>
> Key: SPARK-47619
> URL: https://issues.apache.org/jira/browse/SPARK-47619
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47619) Refine docstring of `to_json/from_json`

2024-03-27 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47619:


 Summary: Refine docstring of `to_json/from_json`
 Key: SPARK-47619
 URL: https://issues.apache.org/jira/browse/SPARK-47619
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.

2024-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47543.
--
  Assignee: Haejoon Lee
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/45699

> Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame 
> creation.
> 
>
> Key: SPARK-47543
> URL: https://issues.apache.org/jira/browse/SPARK-47543
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Currently the PyArrow infers the Pandas dictionary field as StructType 
> instead of MapType, so Spark can't handle the schema properly:
> {code:java}
> >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 
> >>> 'second': 0.3}]})
> >>> pa.Schema.from_pandas(pdf)
> str_col: string
> dict_col: struct
>   child 0, first: double
>   child 1, second: double
> {code}
> We cannot handle this case since we use PyArrow for schema creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.

2024-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47543:
-
Fix Version/s: 4.0.0

> Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame 
> creation.
> 
>
> Key: SPARK-47543
> URL: https://issues.apache.org/jira/browse/SPARK-47543
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently the PyArrow infers the Pandas dictionary field as StructType 
> instead of MapType, so Spark can't handle the schema properly:
> {code:java}
> >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 
> >>> 'second': 0.3}]})
> >>> pa.Schema.from_pandas(pdf)
> str_col: string
> dict_col: struct
>   child 0, first: double
>   child 1, second: double
> {code}
> We cannot handle this case since we use PyArrow for schema creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47107) Implement partition reader for python streaming data source

2024-03-27 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47107.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45485
[https://github.com/apache/spark/pull/45485]

> Implement partition reader for python streaming data source
> ---
>
> Key: SPARK-47107
> URL: https://issues.apache.org/jira/browse/SPARK-47107
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Piggy back the PythonPartitionReaderFactory to implement reading a data 
> partition for python streaming data source. Add test case to verify that 
> python streaming data source can read and process data end to end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47107) Implement partition reader for python streaming data source

2024-03-27 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47107:


Assignee: Chaoqin Li

> Implement partition reader for python streaming data source
> ---
>
> Key: SPARK-47107
> URL: https://issues.apache.org/jira/browse/SPARK-47107
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Piggy back the PythonPartitionReaderFactory to implement reading a data 
> partition for python streaming data source. Add test case to verify that 
> python streaming data source can read and process data end to end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47618) Use Magic Committer for all S3 buckets by default

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47618:
--
Description: 
This issue aims to use Apache Hadoop `Magic Committer` for all S3 buckets by 
default in Apache Spark 4.0.0.

Apache Hadoop `Magic Committer` has been used for S3 buckets to get the best 
performance since [S3 becomes fully consistent on December 1st, 
2020|https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/].
- 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
bq. Amazon S3 provides strong read-after-write consistency for PUT and DELETE 
requests of objects in your Amazon S3 bucket in all AWS Regions. This behavior 
applies to both writes to new objects as well as PUT requests that overwrite 
existing objects and DELETE requests. In addition, read operations on Amazon S3 
Select, Amazon S3 access controls lists (ACLs), Amazon S3 Object Tags, and 
object metadata (for example, the HEAD object) are strongly consistent.


> Use Magic Committer for all S3 buckets by default
> -
>
> Key: SPARK-47618
> URL: https://issues.apache.org/jira/browse/SPARK-47618
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> This issue aims to use Apache Hadoop `Magic Committer` for all S3 buckets by 
> default in Apache Spark 4.0.0.
> Apache Hadoop `Magic Committer` has been used for S3 buckets to get the best 
> performance since [S3 becomes fully consistent on December 1st, 
> 2020|https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/].
> - 
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
> bq. Amazon S3 provides strong read-after-write consistency for PUT and DELETE 
> requests of objects in your Amazon S3 bucket in all AWS Regions. This 
> behavior applies to both writes to new objects as well as PUT requests that 
> overwrite existing objects and DELETE requests. In addition, read operations 
> on Amazon S3 Select, Amazon S3 access controls lists (ACLs), Amazon S3 Object 
> Tags, and object metadata (for example, the HEAD object) are strongly 
> consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47618) Use Magic Committer for all S3 buckets by default

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47618:
-

Assignee: Dongjoon Hyun

> Use Magic Committer for all S3 buckets by default
> -
>
> Key: SPARK-47618
> URL: https://issues.apache.org/jira/browse/SPARK-47618
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47618) Use Magic Committer for all S3 buckets by default

2024-03-27 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47618:
-

 Summary: Use Magic Committer for all S3 buckets by default
 Key: SPARK-47618
 URL: https://issues.apache.org/jira/browse/SPARK-47618
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47618) Use Magic Committer for all S3 buckets by default

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47618:
---
Labels: pull-request-available  (was: )

> Use Magic Committer for all S3 buckets by default
> -
>
> Key: SPARK-47618
> URL: https://issues.apache.org/jira/browse/SPARK-47618
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-27 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47609:
-
Description: 
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .

 

The pull request for the same is 

[https://github.com/apache/spark/pull/43854|https://github.com/apache/spark/pull/43854]

  was:
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*

[jira] [Resolved] (SPARK-47485) Create column with collations in dataframe API

2024-03-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47485.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45569
[https://github.com/apache/spark/pull/45569]

> Create column with collations in dataframe API
> --
>
> Key: SPARK-47485
> URL: https://issues.apache.org/jira/browse/SPARK-47485
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add ability to create string columns with non default collations in the 
> dataframe API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47617:
---
Labels: pull-request-available  (was: )

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47617:
--
Description: 
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TPC-DS testing infrastructure already present in Spark. The idea 
is to vary TPC-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TPC-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).

  was:
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TCP-DS testing infrastructure already present in Spark. The idea 
is to vary TCP-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TCP-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).


> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47617:
--
Summary: Add TPC-DS testing infrastructure for collations  (was: Add TCP-DS 
testing infrastructure for collations)

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TCP-DS testing infrastructure already present in Spark. The 
> idea is to vary TCP-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TCP-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread David Milicevic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831421#comment-17831421
 ] 

David Milicevic commented on SPARK-47413:
-

Hey [~gpgp], let's wait for [~uros-db] to confirm, I'm new to the team so still 
getting familiar with the things.

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47617) Add TCP-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47617:
-

 Summary: Add TCP-DS testing infrastructure for collations
 Key: SPARK-47617
 URL: https://issues.apache.org/jira/browse/SPARK-47617
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TCP-DS testing infrastructure already present in Spark. The idea 
is to vary TCP-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TCP-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831393#comment-17831393
 ] 

Gideon P commented on SPARK-47413:
--

[~davidm-db] please confirm the expected behavior I have outlined above!


> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392
 ] 

Gideon P edited comment on SPARK-47413 at 3/27/24 2:53 PM:
---

>  First confirm what is the expected behaviour for these functions when given 
> collated strings,

h1. Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
-   STRING columns
-   STRING expressions
-   STRING fields in structs

h1. Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 


was (Author: JIRAUSER304403):
>  First confirm what is the expected behaviour for these functions when given 
> collated strings,

# Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
●   STRING columns
●   STRING expressions
●   STRING fields in structs

# Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To 

[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392
 ] 

Gideon P commented on SPARK-47413:
--

>  First confirm what is the expected behaviour for these functions when given 
> collated strings,

# Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
●   STRING columns
●   STRING expressions
●   STRING fields in structs

# Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392
 ] 

Gideon P edited comment on SPARK-47413 at 3/27/24 2:53 PM:
---

bq.  First confirm what is the expected behaviour for these functions when 
given collated strings,

h1. Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
-   STRING columns
-   STRING expressions
-   STRING fields in structs

h1. Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 


was (Author: JIRAUSER304403):
>  First confirm what is the expected behaviour for these functions when given 
> collated strings,

h1. Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
-   STRING columns
-   STRING expressions
-   STRING fields in structs

h1. Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Resolved] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47616.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45736
[https://github.com/apache/spark/pull/45736]

> Document Mapping Spark SQL Data Types from MySQL
> 
>
> Key: SPARK-47616
> URL: https://issues.apache.org/jira/browse/SPARK-47616
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47616:
-

Assignee: Kent Yao

> Document Mapping Spark SQL Data Types from MySQL
> 
>
> Key: SPARK-47616
> URL: https://issues.apache.org/jira/browse/SPARK-47616
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47413:
---
Labels: pull-request-available  (was: )

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files

2024-03-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47564:


Assignee: Wenchen Fan

> always throw FAILED_READ_FILE error when fail to read files
> ---
>
> Key: SPARK-47564
> URL: https://issues.apache.org/jira/browse/SPARK-47564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files

2024-03-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47564.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45723
[https://github.com/apache/spark/pull/45723]

> always throw FAILED_READ_FILE error when fail to read files
> ---
>
> Key: SPARK-47564
> URL: https://issues.apache.org/jira/browse/SPARK-47564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47563) Map normalization upon creation

2024-03-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47563.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45721
[https://github.com/apache/spark/pull/45721]

> Map normalization upon creation
> ---
>
> Key: SPARK-47563
> URL: https://issues.apache.org/jira/browse/SPARK-47563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stevo Mitric
>Assignee: Stevo Mitric
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add handling of map normalization upon creation in ArrayBasedMapBuilder. 
> Currently a map with keys 0.0 and -0.0 will behave as if they are separate 
> values. This will cause issues when doing GROUP BY on map types.
> Refer to this conversion 
> [https://github.com/apache/spark/pull/45549#discussion_r1537803505]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47563) Map normalization upon creation

2024-03-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47563:
---

Assignee: Stevo Mitric

> Map normalization upon creation
> ---
>
> Key: SPARK-47563
> URL: https://issues.apache.org/jira/browse/SPARK-47563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stevo Mitric
>Assignee: Stevo Mitric
>Priority: Major
>  Labels: pull-request-available
>
> Add handling of map normalization upon creation in ArrayBasedMapBuilder. 
> Currently a map with keys 0.0 and -0.0 will behave as if they are separate 
> values. This will cause issues when doing GROUP BY on map types.
> Refer to this conversion 
> [https://github.com/apache/spark/pull/45549#discussion_r1537803505]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."

2024-03-27 Thread Jagadeesh Marada (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831288#comment-17831288
 ] 

Jagadeesh Marada commented on SPARK-47489:
--

I have added the spark driver log with TRACE log level for your better 
understanding of this issue.

> Action performed on Dataset is triggering an error as "The Spark SQL phase 
> planning failed with an internal error."
> ---
>
> Key: SPARK-47489
> URL: https://issues.apache.org/jira/browse/SPARK-47489
> Project: Spark
>  Issue Type: Bug
>  Components: k8s, Spark Core, Spark Submit
>Affects Versions: 3.4.2
>Reporter: Jagadeesh Marada
>Priority: Major
> Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added 
> to spark_submit.jpeg, landing-table-update-1711534916824-driver_TRACE.txt, 
> spark-submit cmd.txt
>
>
> Hi team, 
> We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark 
> driver as below.
> "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> planning failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace."
>  
> Complete stack trace is as attached.
> After analysing further , found that when we try to perform any actions on 
> the dataset, spark is unable to plan its actions, results in this exception. 
> Attaching a Exception stack trace & code snippet where exactly we are 
> trigging an action to convert it as java RDD.
>  
> Attaching the spark submit command as well for your reference.
> Can you please check this issue let us know if any fix available for this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."

2024-03-27 Thread Jagadeesh Marada (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesh Marada updated SPARK-47489:
-
Attachment: (was: landing-table-update-1710866680486-driver.txt)

> Action performed on Dataset is triggering an error as "The Spark SQL phase 
> planning failed with an internal error."
> ---
>
> Key: SPARK-47489
> URL: https://issues.apache.org/jira/browse/SPARK-47489
> Project: Spark
>  Issue Type: Bug
>  Components: k8s, Spark Core, Spark Submit
>Affects Versions: 3.4.2
>Reporter: Jagadeesh Marada
>Priority: Major
> Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added 
> to spark_submit.jpeg, landing-table-update-1711534916824-driver_TRACE.txt, 
> spark-submit cmd.txt
>
>
> Hi team, 
> We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark 
> driver as below.
> "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> planning failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace."
>  
> Complete stack trace is as attached.
> After analysing further , found that when we try to perform any actions on 
> the dataset, spark is unable to plan its actions, results in this exception. 
> Attaching a Exception stack trace & code snippet where exactly we are 
> trigging an action to convert it as java RDD.
>  
> Attaching the spark submit command as well for your reference.
> Can you please check this issue let us know if any fix available for this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."

2024-03-27 Thread Jagadeesh Marada (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesh Marada updated SPARK-47489:
-
Attachment: landing-table-update-1711534916824-driver_TRACE.txt

> Action performed on Dataset is triggering an error as "The Spark SQL phase 
> planning failed with an internal error."
> ---
>
> Key: SPARK-47489
> URL: https://issues.apache.org/jira/browse/SPARK-47489
> Project: Spark
>  Issue Type: Bug
>  Components: k8s, Spark Core, Spark Submit
>Affects Versions: 3.4.2
>Reporter: Jagadeesh Marada
>Priority: Major
> Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added 
> to spark_submit.jpeg, landing-table-update-1710866680486-driver.txt, 
> landing-table-update-1711534916824-driver_TRACE.txt, spark-submit cmd.txt
>
>
> Hi team, 
> We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark 
> driver as below.
> "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> planning failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace."
>  
> Complete stack trace is as attached.
> After analysing further , found that when we try to perform any actions on 
> the dataset, spark is unable to plan its actions, results in this exception. 
> Attaching a Exception stack trace & code snippet where exactly we are 
> trigging an action to convert it as java RDD.
>  
> Attaching the spark submit command as well for your reference.
> Can you please check this issue let us know if any fix available for this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."

2024-03-27 Thread Jagadeesh Marada (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesh Marada updated SPARK-47489:
-
Attachment: landing-table-update-1710866680486-driver.txt

> Action performed on Dataset is triggering an error as "The Spark SQL phase 
> planning failed with an internal error."
> ---
>
> Key: SPARK-47489
> URL: https://issues.apache.org/jira/browse/SPARK-47489
> Project: Spark
>  Issue Type: Bug
>  Components: k8s, Spark Core, Spark Submit
>Affects Versions: 3.4.2
>Reporter: Jagadeesh Marada
>Priority: Major
> Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added 
> to spark_submit.jpeg, landing-table-update-1710866680486-driver.txt, 
> spark-submit cmd.txt
>
>
> Hi team, 
> We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark 
> driver as below.
> "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> planning failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace."
>  
> Complete stack trace is as attached.
> After analysing further , found that when we try to perform any actions on 
> the dataset, spark is unable to plan its actions, results in this exception. 
> Attaching a Exception stack trace & code snippet where exactly we are 
> trigging an action to convert it as java RDD.
>  
> Attaching the spark submit command as well for your reference.
> Can you please check this issue let us know if any fix available for this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47616:
---
Labels: pull-request-available  (was: )

> Document Mapping Spark SQL Data Types from MySQL
> 
>
> Key: SPARK-47616
> URL: https://issues.apache.org/jira/browse/SPARK-47616
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL

2024-03-27 Thread Kent Yao (Jira)
Kent Yao created SPARK-47616:


 Summary: Document Mapping Spark SQL Data Types from MySQL
 Key: SPARK-47616
 URL: https://issues.apache.org/jira/browse/SPARK-47616
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47615) Aggregate + First() Function - ArrayIndexOutOfBoundsException - ColumnPruning?

2024-03-27 Thread Frederik Schreiber (Jira)
Frederik Schreiber created SPARK-47615:
--

 Summary: Aggregate + First() Function - 
ArrayIndexOutOfBoundsException - ColumnPruning?
 Key: SPARK-47615
 URL: https://issues.apache.org/jira/browse/SPARK-47615
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.5.0, 3.4.1
 Environment: Amazon EMR version
emr-7.0.0
Installed applications
Tez 0.10.2, Spark 3.5.0
Amazon Linux release
2023.3.20240312.0
 
1 Master Node m6g.xlarge
2 Core Nodes m6g.2xlarge
 
 
Reporter: Frederik Schreiber


Currently i`m investigating in upgrade our code base from spark 3.3.0 to 3.5.0 
(embedded in dedicated aws emr cluster).
 
I got the following exception if i execute my code on the cluster, if i run 
local unit tests the code runs as expected without exception.
 
 
{code:java}
24/03/26 19:32:19 INFO RecordServerQueryListener: Cleaning up temp directory - 
/user/KKQI7VHKTMNQZJQNMMZXKH5KYNRPOHXG/application_1711468652551_0023 Exception 
in thread "main" org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 
0.3 in stage 12.0 (TID 186) (ip-10-1-1-6.eu-central-1.compute.internal executor 
2): java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
3 at 
org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:95) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnaraggregatetorow_parquetMax_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnaraggregatetorow_nextBatch_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
 Source) at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.doWrite(ShuffleWriteProcessor.scala:45)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at 
org.apache.spark.scheduler.Task.run(Task.scala:143) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629)
 at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
 at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at java.base/java.lang.Thread.run(Thread.java:840)   Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3067)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3003)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3002)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3002) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1318)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1318)
 at scala.Option.foreach(Option.scala:407) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1318)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3271)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3205)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3194)
 at 

[jira] [Resolved] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47611.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45734
[https://github.com/apache/spark/pull/45734]

> Cleanup dead code in MySQLDialect.getCatalystType
> -
>
> Key: SPARK-47611
> URL: https://issues.apache.org/jira/browse/SPARK-47611
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47491) Re-enable `driver log links` test in YarnClusterSuite

2024-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47491.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45618
[https://github.com/apache/spark/pull/45618]

> Re-enable `driver log links` test in YarnClusterSuite
> -
>
> Key: SPARK-47491
> URL: https://issues.apache.org/jira/browse/SPARK-47491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests, YARN
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47614:
---
Labels: pull-request-available  (was: )

> Rename `JavaModuleOptions` to `JVMRuntimeOptions`
> -
>
> Key: SPARK-47614
> URL: https://issues.apache.org/jira/browse/SPARK-47614
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`

2024-03-27 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47614:
---

 Summary: Rename `JavaModuleOptions` to `JVMRuntimeOptions`
 Key: SPARK-47614
 URL: https://issues.apache.org/jira/browse/SPARK-47614
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39983) Should not cache unserialized broadcast relations on the driver

2024-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-39983:
---
Labels: pull-request-available  (was: )

> Should not cache unserialized broadcast relations on the driver
> ---
>
> Key: SPARK-39983
> URL: https://issues.apache.org/jira/browse/SPARK-39983
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in 
> addition to the serialized version of it - 
> {code:java}
> private def writeBlocks(value: T): Int = {
> import StorageLevel._
> // Store a copy of the broadcast variable in the driver so that tasks run 
> on the driver
> // do not create a duplicate copy of the broadcast variable's value.
> val blockManager = SparkEnv.get.blockManager
> if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, 
> tellMaster = false)) {
>   throw new SparkException(s"Failed to store $broadcastId in 
> BlockManager")
> }
>  {code}
> In case of broadcast relations, these objects can be fairly large (60MB in 
> one observed case) and are not strictly necessary on the driver.
> Add the option to not keep the unserialized versions of the objects.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47613) Issue with Spark Connect on Python 3.12

2024-03-27 Thread Kai-Michael Roesner (Jira)
Kai-Michael Roesner created SPARK-47613:
---

 Summary: Issue with Spark Connect on Python 3.12
 Key: SPARK-47613
 URL: https://issues.apache.org/jira/browse/SPARK-47613
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.5.0, 3.4.1
Reporter: Kai-Michael Roesner


When trying to create a remote Spark session with PySpark on Python 3.12 a 
{{ModuleNotFoundError: No module named 'distutils'}} excpetion is thrown. In 
Python 3.12 {{distutils}} was removed from the stdlib. As a workaround we can 
{{import setuptools}} before creating the session. See also [this question on 
SOF|https://stackoverflow.com/questions/78207291] and the 
[answer|https://stackoverflow.com/a/78212125/11474852] by Anderson Bravalheri.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size

2024-03-27 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-47612:
---
Description: 
Now we pick up the side of partially clustered distribution:

SPJ currently relies on a simple heuristic and always pick the side with less 
data size based on table statistics as the side fully clustered, even though it 
could also contain skewed partitions. 


We can potentially do fine-grained comparison based on partition values, since 
we have the information now.

  was:
Now we pick up the side of partially clustered distribution:


Using plan statistics to determine which side of join to fully
cluster partition values.

We can optimize to use partition size since we have the information now.


> Improve picking the side of partially clustered distribution accroding to 
> partition size
> 
>
> Key: SPARK-47612
> URL: https://issues.apache.org/jira/browse/SPARK-47612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Qi Zhu
>Priority: Major
>
> Now we pick up the side of partially clustered distribution:
> SPJ currently relies on a simple heuristic and always pick the side with less 
> data size based on table statistics as the side fully clustered, even though 
> it could also contain skewed partitions. 
> We can potentially do fine-grained comparison based on partition values, 
> since we have the information now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org