[jira] [Updated] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size

2024-03-26 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-47612:
---
Description: 
Now we pick up the side of partially clustered distribution:

SPJ currently relies on a simple heuristic and always pick the side with less 
data size based on table statistics as the side fully clustered, even though it 
could also contain skewed partitions. 


We can potentially do fine-grained comparison based on partition values, since 
we have the information now.

  was:
Now we pick up the side of partially clustered distribution:


Using plan statistics to determine which side of join to fully
cluster partition values.

We can optimize to use partition size since we have the information now.


> Improve picking the side of partially clustered distribution accroding to 
> partition size
> 
>
> Key: SPARK-47612
> URL: https://issues.apache.org/jira/browse/SPARK-47612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Qi Zhu
>Priority: Major
>
> Now we pick up the side of partially clustered distribution:
> SPJ currently relies on a simple heuristic and always pick the side with less 
> data size based on table statistics as the side fully clustered, even though 
> it could also contain skewed partitions. 
> We can potentially do fine-grained comparison based on partition values, 
> since we have the information now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size

2024-03-26 Thread Qi Zhu (Jira)
Qi Zhu created SPARK-47612:
--

 Summary: Improve picking the side of partially clustered 
distribution accroding to partition size
 Key: SPARK-47612
 URL: https://issues.apache.org/jira/browse/SPARK-47612
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Qi Zhu


Now we pick up the side of partially clustered distribution:


Using plan statistics to determine which side of join to fully
cluster partition values.

We can optimize to use partition size since we have the information now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47611:
---
Labels: pull-request-available  (was: )

> Cleanup dead code in MySQLDialect.getCatalystType
> -
>
> Key: SPARK-47611
> URL: https://issues.apache.org/jira/browse/SPARK-47611
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType

2024-03-26 Thread Kent Yao (Jira)
Kent Yao created SPARK-47611:


 Summary: Cleanup dead code in MySQLDialect.getCatalystType
 Key: SPARK-47611
 URL: https://issues.apache.org/jira/browse/SPARK-47611
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true

2024-03-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-47610.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45733
[https://github.com/apache/spark/pull/45733]

> Always set io.netty.tryReflectionSetAccessible=true
> ---
>
> Key: SPARK-47610
> URL: https://issues.apache.org/jira/browse/SPARK-47610
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true

2024-03-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-47610:


Assignee: Cheng Pan

> Always set io.netty.tryReflectionSetAccessible=true
> ---
>
> Key: SPARK-47610
> URL: https://issues.apache.org/jira/browse/SPARK-47610
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42040.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42040:


Assignee: Qi Zhu  (was: zhuqi)

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42040:


Assignee: zhuqi

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47562) Factor literal handling out of `plan.py`

2024-03-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-47562.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45719
[https://github.com/apache/spark/pull/45719]

> Factor literal handling out of `plan.py`
> 
>
> Key: SPARK-47562
> URL: https://issues.apache.org/jira/browse/SPARK-47562
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47562) Factor literal handling out of `plan.py`

2024-03-26 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-47562:
-

Assignee: Ruifeng Zheng

> Factor literal handling out of `plan.py`
> 
>
> Key: SPARK-47562
> URL: https://issues.apache.org/jira/browse/SPARK-47562
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47570) Integrate range scan encoder changes with timer implementation

2024-03-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47570:


Assignee: Jing Zhan

> Integrate range scan encoder changes with timer implementation
> --
>
> Key: SPARK-47570
> URL: https://issues.apache.org/jira/browse/SPARK-47570
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Assignee: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47570) Integrate range scan encoder changes with timer implementation

2024-03-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47570.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45709
[https://github.com/apache/spark/pull/45709]

> Integrate range scan encoder changes with timer implementation
> --
>
> Key: SPARK-47570
> URL: https://issues.apache.org/jira/browse/SPARK-47570
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Assignee: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47273) Implement python stream writer interface

2024-03-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47273.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45305
[https://github.com/apache/spark/pull/45305]

> Implement python stream writer interface
> 
>
> Key: SPARK-47273
> URL: https://issues.apache.org/jira/browse/SPARK-47273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In order to support developing spark streaming sink in python, we need to 
> implement python stream writer interface.
> Reuse PythonPartitionWriter to implement the serialization and execution of 
> write callback in executor.
> Implement python worker process to run python streaming data sink committer 
> and communicate with JVM through socket in spark driver. For each python 
> streaming data sink instance there will be a long live python worker process 
> created. Inside the python process, the python write committer will receive 
> abort or commit function call and send back result through socket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47273) Implement python stream writer interface

2024-03-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47273:


Assignee: Chaoqin Li

> Implement python stream writer interface
> 
>
> Key: SPARK-47273
> URL: https://issues.apache.org/jira/browse/SPARK-47273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> In order to support developing spark streaming sink in python, we need to 
> implement python stream writer interface.
> Reuse PythonPartitionWriter to implement the serialization and execution of 
> write callback in executor.
> Implement python worker process to run python streaming data sink committer 
> and communicate with JVM through socket in spark driver. For each python 
> streaming data sink instance there will be a long live python worker process 
> created. Inside the python process, the python write committer will receive 
> abort or commit function call and send back result through socket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47498) Refine some fractional GPU resource calculation tests.

2024-03-26 Thread Wu Yi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wu Yi reassigned SPARK-47498:
-

Assignee: Bobby Wang

> Refine some fractional GPU resource calculation tests.
> --
>
> Key: SPARK-47498
> URL: https://issues.apache.org/jira/browse/SPARK-47498
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47498) Refine some fractional GPU resource calculation tests.

2024-03-26 Thread Wu Yi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wu Yi resolved SPARK-47498.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45631
[https://github.com/apache/spark/pull/45631]

> Refine some fractional GPU resource calculation tests.
> --
>
> Key: SPARK-47498
> URL: https://issues.apache.org/jira/browse/SPARK-47498
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47610:
---
Labels: pull-request-available  (was: )

> Always set io.netty.tryReflectionSetAccessible=true
> ---
>
> Key: SPARK-47610
> URL: https://issues.apache.org/jira/browse/SPARK-47610
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true

2024-03-26 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-47610:
-

 Summary: Always set io.netty.tryReflectionSetAccessible=true
 Key: SPARK-47610
 URL: https://issues.apache.org/jira/browse/SPARK-47610
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-26 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47609:
-
Description: 
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test

org.apache.spark.sql.DatasetCacheSuite
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .

  was:
This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}

[jira] [Created] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan

2024-03-26 Thread Asif (Jira)
Asif created SPARK-47609:


 Summary: CacheManager Lookup can miss picking InMemoryRelation 
corresponding to subplan
 Key: SPARK-47609
 URL: https://issues.apache.org/jira/browse/SPARK-47609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


This issue became apparent while bringing my PR 
[https://github.com/apache/spark/pull/43854]

in synch with latest master.

Though that PR is meant to do early collapse of projects so that the tree size 
is kept at minimum when projects keep getting added , in the analyzer phase 
itself.

But as part of the work, the CacheManager lookup also needed to be modified.

One of the newly added test in master failed. On analysis of failure it turns 
out that the cache manager is not picking cached InMemoryRelation for a subplan.

This shows up in following existing test
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
{color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color}
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

{color:#00875a}// Verify that df1's cache has stayed the same, since df1's 
cache already has data{color}
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}{quote}
{quote}*{color:#de350b}// This assertion is not right{color}*
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

Since df1 exists in the cache as InMemoryRelation,

val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df2 is derivable from the cached df1.

So when val df2Limit = df2.limit(2), is created,  it should utilize the cached 
df1 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831116#comment-17831116
 ] 

Asif edited comment on SPARK-26708 at 3/27/24 12:58 AM:


I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst
Unknown macro: \{ case i}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}
{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df1 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).


was (Author: ashahid7):
I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df2 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>

[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831117#comment-17831117
 ] 

Asif edited comment on SPARK-26708 at 3/27/24 12:54 AM:


Towards that please take a look at ticket & PR:

https://issues.apache.org/jira/browse/SPARK-45959

 

and the PR associated with it.

Though that PR primarily deals with aggressive collapse of projects at the end 
of analysis . But it also as part of fix, uses enhanced cached plan lookup and 
thus results in the above behaviour.


was (Author: ashahid7):
Towards that please take a look at ticket & PR:

[https://issues.apache.org/jira/browse/SPARK-45959|https://issues.apache.org/jira/browse/SPARK-45959]

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2024-03-26 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831116#comment-17831116
 ] 

Asif commented on SPARK-26708:
--

I believe the current caching logic is suboptimal and accordingly the bug test 
for it is testing a suboptimal approach.

The bug test for this is
{quote}test("SPARK-26708 Cache data and cached plan should stay consistent") {
val df = spark.range(0, 5).toDF("a")
val df1 = df.withColumn("b", $"a" + 1)
val df2 = df.filter($"a" > 1)

df.cache()
// Add df1 to the CacheManager; the buffer is currently empty.
df1.cache()
// After calling collect(), df1's buffer has been loaded.
df1.collect()
// Add df2 to the CacheManager; the buffer is currently empty.
df2.cache()

// Verify that df1 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df1)
val df1InnerPlan = df1.queryExecution.withCachedData
.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
// Verify that df2 is a InMemoryRelation plan with dependency on another cached 
plan.
assertCacheDependency(df2)

df.unpersist(blocking = true)

// Verify that df1's cache has stayed the same, since df1's cache already has 
data
// before df.unpersist().
val df1Limit = df1.limit(2)
val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan)

// Verify that df2's cache has been re-cached, with a new physical plan rid of 
dependency
// on df, since df2's cache had not been loaded before df.unpersist().
val df2Limit = df2.limit(2)
val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst {
case i: InMemoryRelation => i.cacheBuilder.cachedPlan
}
assert(df2LimitInnerPlan.isDefined &&
!df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec]))
}{quote}
 

The optimal caching should have resulted in df2LimitInnerPlan  actually 
containing  InMemoryTableScanExec which should have corresponded to df1.

The reason being that since df2 was already materialized, so it exists in the 
cache rightly.

And df2 is derivable from the cached df1 ( it just has extra projection but 
otherwise can serve the df2).

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47561) fix analyzer rule order issues about Alias

2024-03-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-47561:

Fix Version/s: 3.5.2

> fix analyzer rule order issues about Alias
> --
>
> Key: SPARK-47561
> URL: https://issues.apache.org/jira/browse/SPARK-47561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47558:
---
Labels: pull-request-available  (was: )

> [Arbitrary State Support] State TTL support - ValueState
> 
>
> Key: SPARK-47558
> URL: https://issues.apache.org/jira/browse/SPARK-47558
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>
> Add support for expiring state value based on ttl for Value State in 
> transformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47608) Improve user experience of loading logs as json data source

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47608:
--

 Summary: Improve user experience of loading logs as json data 
source
 Key: SPARK-47608
 URL: https://issues.apache.org/jira/browse/SPARK-47608
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang


E.g. create a constant table schema in object Loggig so that users can query 
the json log files easily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47607) Add documentation for Structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47607:
--

 Summary: Add documentation for Structured logging framework
 Key: SPARK-47607
 URL: https://issues.apache.org/jira/browse/SPARK-47607
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47606) Create log4j templates for both structured logging and plain text logging

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47606:
--

 Summary: Create log4j templates for both structured logging and 
plain text logging
 Key: SPARK-47606
 URL: https://issues.apache.org/jira/browse/SPARK-47606
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47605) Enable structured logging in all the test log4j2.properties

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47605:
--

 Summary: Enable structured logging in all the test 
log4j2.properties
 Key: SPARK-47605
 URL: https://issues.apache.org/jira/browse/SPARK-47605
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47604) Resource managers: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47604:
--

 Summary: Resource managers: Migrate logInfo with variables to 
structured logging framework
 Key: SPARK-47604
 URL: https://issues.apache.org/jira/browse/SPARK-47604
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47600) MLLib: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47600:
--

 Summary: MLLib: Migrate logInfo with variables to structured 
logging framework
 Key: SPARK-47600
 URL: https://issues.apache.org/jira/browse/SPARK-47600
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47601) Graphx: Migrate logs with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47601:
--

 Summary: Graphx:  Migrate logs with variables to structured 
logging framework
 Key: SPARK-47601
 URL: https://issues.apache.org/jira/browse/SPARK-47601
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47598) MLLib: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47598:
--

 Summary: MLLib: Migrate logError with variables to structured 
logging framework
 Key: SPARK-47598
 URL: https://issues.apache.org/jira/browse/SPARK-47598
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47599) MLLib: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47599:
--

 Summary: MLLib: Migrate logWarn with variables to structured 
logging framework
 Key: SPARK-47599
 URL: https://issues.apache.org/jira/browse/SPARK-47599
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47594) Connector module: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47594:
--

 Summary: Connector module: Migrate logInfo with variables to 
structured logging framework
 Key: SPARK-47594
 URL: https://issues.apache.org/jira/browse/SPARK-47594
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47592) Connector module: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47592:
--

 Summary: Connector module: Migrate logError with variables to 
structured logging framework
 Key: SPARK-47592
 URL: https://issues.apache.org/jira/browse/SPARK-47592
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47593) Connector module: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47593:
--

 Summary: Connector module: Migrate logWarn with variables to 
structured logging framework
 Key: SPARK-47593
 URL: https://issues.apache.org/jira/browse/SPARK-47593
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47590) Hive-thriftserver: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47590:
--

 Summary: Hive-thriftserver: Migrate logWarn with variables to 
structured logging framework
 Key: SPARK-47590
 URL: https://issues.apache.org/jira/browse/SPARK-47590
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47591) Hive-thriftserver: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47591:
--

 Summary: Hive-thriftserver: Migrate logInfo with variables to 
structured logging framework
 Key: SPARK-47591
 URL: https://issues.apache.org/jira/browse/SPARK-47591
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47588) Hive module: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47588:
--

 Summary: Hive module: Migrate logInfo with variables to structured 
logging framework
 Key: SPARK-47588
 URL: https://issues.apache.org/jira/browse/SPARK-47588
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47583) SQL core: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47583:
--

 Summary: SQL core: Migrate logError with variables to structured 
logging framework
 Key: SPARK-47583
 URL: https://issues.apache.org/jira/browse/SPARK-47583
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47589) Hive-thriftserver: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47589:
--

 Summary: Hive-thriftserver: Migrate logError with variables to 
structured logging framework
 Key: SPARK-47589
 URL: https://issues.apache.org/jira/browse/SPARK-47589
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47586) Hive module: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47586:
--

 Summary: Hive module: Migrate logError with variables to 
structured logging framework
 Key: SPARK-47586
 URL: https://issues.apache.org/jira/browse/SPARK-47586
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47587) Hive module: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47587:
--

 Summary: Hive module: Migrate logWarn with variables to structured 
logging framework
 Key: SPARK-47587
 URL: https://issues.apache.org/jira/browse/SPARK-47587
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47585) SQL core: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47585:
--

 Summary: SQL core: Migrate logInfo with variables to structured 
logging framework
 Key: SPARK-47585
 URL: https://issues.apache.org/jira/browse/SPARK-47585
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47584) SQL core: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47584:
--

 Summary: SQL core: Migrate logWarn with variables to structured 
logging framework
 Key: SPARK-47584
 URL: https://issues.apache.org/jira/browse/SPARK-47584
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47582) SQL catalyst: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47582:
--

 Summary: SQL catalyst: Migrate logInfo with variables to 
structured logging framework
 Key: SPARK-47582
 URL: https://issues.apache.org/jira/browse/SPARK-47582
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47581) SQL catalyst: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47581:
--

 Summary: SQL catalyst: Migrate logWarn with variables to 
structured logging framework
 Key: SPARK-47581
 URL: https://issues.apache.org/jira/browse/SPARK-47581
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47579) Spark core: Migrate logInfo with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47579:
--

 Summary: Spark core: Migrate logInfo with variables to structured 
logging framework
 Key: SPARK-47579
 URL: https://issues.apache.org/jira/browse/SPARK-47579
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47580) SQL catalyst: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47580:
--

 Summary: SQL catalyst: Migrate logError with variables to 
structured logging framework
 Key: SPARK-47580
 URL: https://issues.apache.org/jira/browse/SPARK-47580
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47577) Spark core: Migrate logError with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47577:
--

 Summary: Spark core: Migrate logError with variables to structured 
logging framework
 Key: SPARK-47577
 URL: https://issues.apache.org/jira/browse/SPARK-47577
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47578) Spark core: Migrate logWarn with variables to structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47578:
--

 Summary: Spark core: Migrate logWarn with variables to structured 
logging framework
 Key: SPARK-47578
 URL: https://issues.apache.org/jira/browse/SPARK-47578
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47575) Implement logWarn API in structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47575:
--

 Summary: Implement logWarn API in structured logging framework
 Key: SPARK-47575
 URL: https://issues.apache.org/jira/browse/SPARK-47575
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47576) Implement logInfo API in structured logging framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47576:
--

 Summary: Implement logInfo API in structured logging framework
 Key: SPARK-47576
 URL: https://issues.apache.org/jira/browse/SPARK-47576
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-26 Thread Sundeep K (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sundeep K updated SPARK-47556:
--
Description: 
h3. Issue:

We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
investigation we noticed that there are 2 spark jobs that launched with same 
application id and when one of them finishes first it deletes all it's 
resources and deletes the resources of other job too.

-> Spark application ID is created using this 
[code|https://github.com/apache/spark/blob/36126a5c1821b4418afd5788963a939ea7f64078/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L38]
"spark-application-" + System.currentTimeMillis
This means if 2 applications launch at the same milli second they could end up 
having same AppId

->  
[spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
 label is added to all resource created by driver and it's value is application 
Id. Kubernetes Scheduler deletes all the apps with same 
[label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
 upon termination.

This results in deletion of config map and executor pods of job that's still 
running, driver tries to relaunch the executor pods, but config map is not 
present, so it's in crash loop
h3. Context

We are using [Spark of Kubernetes 
|https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our 
spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s 
namespace. Each Spark job can be launched from different pods or from different 
processes in a pod. Every time a job is launched it has a unique app name. Here 
is how the job is launched (omitting irrelevant details):
{code:java}
# spark_conf has settings required for spark on k8s 
sp = SparkSession.builder \
.config(conf=spark_conf) \
.appName('testapp')
sp.master(f'k8s://{kubernetes_host}')
session = sp.getOrCreate()
with session:
session.sql('SELECT 1'){code}
h3. Repro

Set same app id in spark config, run 2 different jobs, one that finishes fast, 
one that runs slow. Slower job goes into crash loop
{code:java}
"spark.app.id": ""{code}
h3. Workaround

Set unique spark.app.id for all the jobs that run on k8s

eg:
{code:java}
"spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
h3. Fix

Add unique hash add the end of Application ID: 
[https://github.com/apache/spark/pull/45712] 

 

  was:
h3. Issue:

We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
investigation we noticed that there are 2 spark jobs that launched with same 
application id and when one of them finishes first it deletes all it's 
resources and deletes the resources of other job too.

-> Spark application ID is created using this 
[code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
 
"spark-application-" + System.currentTimeMillis
This means if 2 applications launch at the same milli second they could end up 
having same AppId

->  
[spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
 label is added to all resource created by driver and it's value is application 
Id. Kubernetes Scheduler deletes all the apps with same 
[label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
 upon termination.

This results in deletion of config map and executor pods of job that's still 
running, driver tries to relaunch the executor pods, but config map is not 
present, so it's in crash loop
h3. Context

We are using [Spark of Kubernetes 
|https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our 
spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s 
namespace. Each Spark job can be launched from different pods or from different 
processes in a pod. Every time a job is launched it has a unique app name. Here 
is how the job is launched (omitting irrelevant details):
{code:java}
# spark_conf has settings required for spark on k8s 
sp = SparkSession.builder \
.config(conf=spark_conf) \
.appName('testapp')
sp.master(f'k8s://{kubernetes_host}')
session = sp.getOrCreate()
with session:
session.sql('SELECT 1'){code}
h3. Repro

Set 

[jira] [Updated] (SPARK-47572) Enforce Window partitionSpec is orderable.

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47572:
---
Labels: pull-request-available  (was: )

> Enforce Window partitionSpec is orderable.
> --
>
> Key: SPARK-47572
> URL: https://issues.apache.org/jira/browse/SPARK-47572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.1, 3.3.4
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47574) Introduce Structured Logging Framework

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47574:
---
Labels: pull-request-available  (was: )

> Introduce Structured Logging Framework
> --
>
> Key: SPARK-47574
> URL: https://issues.apache.org/jira/browse/SPARK-47574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>
> Introduce Structured Logging Framework as per 
> [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing]
>  .
>  * The default logging output format will be json lines. For example 
> {code:java}
> {
>    "ts":"2023-03-12T12:02:46.661-0700",
>    "level":"ERROR",
>    "msg":"Cannot determine whether executor 289 is alive or not",
>    "context":{
>        "executor_id":"289"
>    },
>    "exception":{
>       "class":"org.apache.spark.SparkException",
>       "msg":"Exception thrown in awaitResult",
>       "stackTrace":"..."
>    },
>    "source":"BlockManagerMasterEndpoint"
> } {code}
>  * Introduce a new configuration `spark.log.structuredLogging.enabled` to 
> control the default log4j configuration. Users can set it as false to get 
> plain text log outputs
>  * The change will start with logError method. Example changes on the API: 
> from
> `logError(s"Cannot determine whether executor $executorId is alive or not.", 
> e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
> executorId)} is alive or not.", e)`
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47573) Support custom driver log URLs for Kubernetes

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47573:
---
Labels: pull-request-available  (was: )

> Support custom driver log URLs for Kubernetes
> -
>
> Key: SPARK-47573
> URL: https://issues.apache.org/jira/browse/SPARK-47573
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: pull-request-available
>
> Spark provides the option to set the URL for *executor* logs via 
> {{spark.ui.custom.executor.log.url}}. This should be possible for *driver* 
> logs as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47574) Introduce Structured Logging Framework

2024-03-26 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-47574:
---
Description: 
Introduce Structured Logging Framework as per 
[https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing]
 .
 * The default logging output format will be json lines. For example 
{code:java}
{
   "ts":"2023-03-12T12:02:46.661-0700",
   "level":"ERROR",
   "msg":"Cannot determine whether executor 289 is alive or not",
   "context":{
       "executor_id":"289"
   },
   "exception":{
      "class":"org.apache.spark.SparkException",
      "msg":"Exception thrown in awaitResult",
      "stackTrace":"..."
   },
   "source":"BlockManagerMasterEndpoint"
} {code}

 * Introduce a new configuration `spark.log.structuredLogging.enabled` to 
control the default log4j configuration. Users can set it as false to get plain 
text log outputs
 * The change will start with logError method. Example changes on the API: from
`logError(s"Cannot determine whether executor $executorId is alive or not.", 
e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
executorId)} is alive or not.", e)`

 

  was:
Introduce Structured Logging Framework as per 
[https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing]
 .
 * The default logging output format will be json lines. For example 
{code:java}
{
   "ts":"2023-03-12T12:02:46.661-0700",
   "level":"ERROR",
   "msg":"Cannot determine whether executor 289 is alive or not",
   "context":{
       "executor_id":"289"
   },
   "exception":{
      "class":"org.apache.spark.SparkException",
      "msg":"Exception thrown in awaitResult",
      "stackTrace":"..."
   },
   "source":"BlockManagerMasterEndpoint"
} {code}

 * Introduce a new configuration `spark.log.structuredLogging.enabled` to 
control the default log4j configuration. Users can set it as false to get plain 
text log outputs
 * The change will start with logError method. The Logging API will be changed 
from
`logError(s"Cannot determine whether executor $executorId is alive or not.", 
e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
executorId)} is alive or not.", e)`

 


> Introduce Structured Logging Framework
> --
>
> Key: SPARK-47574
> URL: https://issues.apache.org/jira/browse/SPARK-47574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Introduce Structured Logging Framework as per 
> [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing]
>  .
>  * The default logging output format will be json lines. For example 
> {code:java}
> {
>    "ts":"2023-03-12T12:02:46.661-0700",
>    "level":"ERROR",
>    "msg":"Cannot determine whether executor 289 is alive or not",
>    "context":{
>        "executor_id":"289"
>    },
>    "exception":{
>       "class":"org.apache.spark.SparkException",
>       "msg":"Exception thrown in awaitResult",
>       "stackTrace":"..."
>    },
>    "source":"BlockManagerMasterEndpoint"
> } {code}
>  * Introduce a new configuration `spark.log.structuredLogging.enabled` to 
> control the default log4j configuration. Users can set it as false to get 
> plain text log outputs
>  * The change will start with logError method. Example changes on the API: 
> from
> `logError(s"Cannot determine whether executor $executorId is alive or not.", 
> e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
> executorId)} is alive or not.", e)`
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47574) Introduce Structured Logging Framework

2024-03-26 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47574:
--

 Summary: Introduce Structured Logging Framework
 Key: SPARK-47574
 URL: https://issues.apache.org/jira/browse/SPARK-47574
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Introduce Structured Logging Framework as per 
[https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing]
 .
 * The default logging output format will be json lines. For example 
{code:java}
{
   "ts":"2023-03-12T12:02:46.661-0700",
   "level":"ERROR",
   "msg":"Cannot determine whether executor 289 is alive or not",
   "context":{
       "executor_id":"289"
   },
   "exception":{
      "class":"org.apache.spark.SparkException",
      "msg":"Exception thrown in awaitResult",
      "stackTrace":"..."
   },
   "source":"BlockManagerMasterEndpoint"
} {code}

 * Introduce a new configuration `spark.log.structuredLogging.enabled` to 
control the default log4j configuration. Users can set it as false to get plain 
text log outputs
 * The change will start with logError method. The Logging API will be changed 
from
`logError(s"Cannot determine whether executor $executorId is alive or not.", 
e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
executorId)} is alive or not.", e)`

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47572) Enforce Window partitionSpec is orderable.

2024-03-26 Thread Chenhao Li (Jira)
Chenhao Li created SPARK-47572:
--

 Summary: Enforce Window partitionSpec is orderable.
 Key: SPARK-47572
 URL: https://issues.apache.org/jira/browse/SPARK-47572
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.4, 3.5.1, 3.4.1
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47571) date_format() java.lang.ArithmeticException: long overflow for large dates

2024-03-26 Thread Serge Rielau (Jira)
Serge Rielau created SPARK-47571:


 Summary: date_format() java.lang.ArithmeticException: long 
overflow for large dates
 Key: SPARK-47571
 URL: https://issues.apache.org/jira/browse/SPARK-47571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


The following works for CATS, but not for DATE_FORMAT():

select  cast(cast('5881580' AS DATE) AS STRING);
+5881580-01-01

spark-sql (default)> select date_format(cast('5881580' AS DATE), 
'yyy-mm-dd');

24/03/26 11:08:23 ERROR SparkSQLDriver: Failed in [select 
date_format(cast('5881580' AS DATE), 'yyy-mm-dd')]

java.lang.ArithmeticException: long overflow

 at java.base/java.lang.Math.multiplyExact(Math.java:1004)

 at 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.instantToMicros(SparkDateTimeUtils.scala:122)

 at 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.instantToMicros$(SparkDateTimeUtils.scala:116)

 at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.instantToMicros(DateTimeUtils.scala:41)

 at 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.daysToMicros(SparkDateTimeUtils.scala:174)

 at 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.daysToMicros$(SparkDateTimeUtils.scala:172)

 at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.daysToMicros(DateTimeUtils.scala:41)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToTimestamp$14(Cast.scala:642)

 at scala.runtime.java8.JFunction1$mcJI$sp.apply(JFunction1$mcJI$sp.scala:17)

 at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:557)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToTimestamp$13(Cast.scala:642)

 at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1170)

 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:558)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47570) Integrate range scan encoder changes with timer implementation

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47570:
---
Labels: pull-request-available  (was: )

> Integrate range scan encoder changes with timer implementation
> --
>
> Key: SPARK-47570
> URL: https://issues.apache.org/jira/browse/SPARK-47570
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47570) Integrate range scan encoder changes with timer implementation

2024-03-26 Thread Jing Zhan (Jira)
Jing Zhan created SPARK-47570:
-

 Summary: Integrate range scan encoder changes with timer 
implementation
 Key: SPARK-47570
 URL: https://issues.apache.org/jira/browse/SPARK-47570
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jing Zhan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47569) Disallow comparing variant.

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47569:
---
Labels: pull-request-available  (was: )

> Disallow comparing variant.
> ---
>
> Key: SPARK-47569
> URL: https://issues.apache.org/jira/browse/SPARK-47569
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47569) Disallow comparing variant.

2024-03-26 Thread Chenhao Li (Jira)
Chenhao Li created SPARK-47569:
--

 Summary: Disallow comparing variant.
 Key: SPARK-47569
 URL: https://issues.apache.org/jira/browse/SPARK-47569
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47555) Show a warning message about SQLException if `JDBCTableCatalog.loadTable` fails

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47555:
--
Summary: Show a warning message about SQLException if 
`JDBCTableCatalog.loadTable` fails  (was: Record necessary raw exception log 
when loadTable)

> Show a warning message about SQLException if `JDBCTableCatalog.loadTable` 
> fails
> ---
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xleoken
>Assignee: xleoken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47566) SubstringIndex

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47566:
---
Labels: pull-request-available  (was: )

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47555:
-

Assignee: xleoken

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xleoken
>Assignee: xleoken
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47568) Fix race condition between maintenance thread and task thead for RocksDB snapshot

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47568:
---
Labels: pull-request-available  (was: )

> Fix race condition between maintenance thread and task thead for RocksDB 
> snapshot
> -
>
> Key: SPARK-47568
> URL: https://issues.apache.org/jira/browse/SPARK-47568
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>
> There are currently some race conditions between maintenance thread and task 
> thread which can result in corrupted checkpoint state.
>  # The maintenance thread currently relies on class variable {{lastSnapshot}} 
> to find the latest checkpoint and uploads it to DFS. This checkpoint can be 
> modified at commit time by Task thread if a new snapshot is created.
>  # The task thread does not reset lastSnapshot at load time, which can result 
> in newer snapshots (if a old version is loaded) being considered valid and 
> uploaded to DFS. This results in VersionIdMismatch errors.
> This issue proposes to fix these issues by guarding latestSnapshot variable 
> modification, and setting latestSnapshot properly at load time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47555:
--
Parent: SPARK-47361
Issue Type: Sub-task  (was: Improvement)

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xleoken
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47555.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45711
[https://github.com/apache/spark/pull/45711]

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xleoken
>Assignee: xleoken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47568) Fix race condition between maintenance thread and task thead for RocksDB snapshot

2024-03-26 Thread Bhuwan Sahni (Jira)
Bhuwan Sahni created SPARK-47568:


 Summary: Fix race condition between maintenance thread and task 
thead for RocksDB snapshot
 Key: SPARK-47568
 URL: https://issues.apache.org/jira/browse/SPARK-47568
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.5.1, 3.5.0, 4.0.0, 3.5.2
Reporter: Bhuwan Sahni


There are currently some race conditions between maintenance thread and task 
thread which can result in corrupted checkpoint state.
 # The maintenance thread currently relies on class variable {{lastSnapshot}} 
to find the latest checkpoint and uploads it to DFS. This checkpoint can be 
modified at commit time by Task thread if a new snapshot is created.
 # The task thread does not reset lastSnapshot at load time, which can result 
in newer snapshots (if a old version is loaded) being considered valid and 
uploaded to DFS. This results in VersionIdMismatch errors.

This issue proposes to fix these issues by guarding latestSnapshot variable 
modification, and setting latestSnapshot properly at load time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47555:
--
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xleoken
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState

2024-03-26 Thread Bhuwan Sahni (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830977#comment-17830977
 ] 

Bhuwan Sahni commented on SPARK-47558:
--

https://github.com/apache/spark/pull/45674

> [Arbitrary State Support] State TTL support - ValueState
> 
>
> Key: SPARK-47558
> URL: https://issues.apache.org/jira/browse/SPARK-47558
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Priority: Major
>
> Add support for expiring state value based on ttl for Value State in 
> transformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47561) fix analyzer rule order issues about Alias

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47561:
-

Assignee: Wenchen Fan

> fix analyzer rule order issues about Alias
> --
>
> Key: SPARK-47561
> URL: https://issues.apache.org/jira/browse/SPARK-47561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47561) fix analyzer rule order issues about Alias

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47561.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45718
[https://github.com/apache/spark/pull/45718]

> fix analyzer rule order issues about Alias
> --
>
> Key: SPARK-47561
> URL: https://issues.apache.org/jira/browse/SPARK-47561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47544.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45700
[https://github.com/apache/spark/pull/45700]

> [Pyspark] SparkSession builder method is incompatible with vs code 
> intellisense
> ---
>
> Key: SPARK-47544
> URL: https://issues.apache.org/jira/browse/SPARK-47544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: old.mov
>
>
> VS code's intellisense is unable to recognize the methods under 
> `SparkSession.builder`.
>  
> See attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47544:
-

Assignee: Niranjan Jayakar

> [Pyspark] SparkSession builder method is incompatible with vs code 
> intellisense
> ---
>
> Key: SPARK-47544
> URL: https://issues.apache.org/jira/browse/SPARK-47544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
> Attachments: old.mov
>
>
> VS code's intellisense is unable to recognize the methods under 
> `SparkSession.builder`.
>  
> See attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47557) Audit MySQL ENUM/SET Types

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47557:
-

Assignee: Kent Yao

> Audit MySQL ENUM/SET Types
> --
>
> Key: SPARK-47557
> URL: https://issues.apache.org/jira/browse/SPARK-47557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47557) Audit MySQL ENUM/SET Types

2024-03-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47557.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45713
[https://github.com/apache/spark/pull/45713]

> Audit MySQL ENUM/SET Types
> --
>
> Key: SPARK-47557
> URL: https://issues.apache.org/jira/browse/SPARK-47557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-26 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830952#comment-17830952
 ] 

Gideon P commented on SPARK-47413:
--

[~davidm-db] Awesome, thanks! 

Do you have any guidance BTW as to when I should try to get this completed by? 

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Milan Dankovic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829408#comment-17829408
 ] 

Milan Dankovic edited comment on SPARK-47477 at 3/26/24 1:54 PM:
-

I am working on SubstringIndex sub-task


was (Author: JIRAUSER304529):
I am working on this

> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *SubstringIndex* and *StringLocate* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* and 
> *StringLocate* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47567) StringLocate

2024-03-26 Thread Milan Dankovic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Dankovic updated SPARK-47567:
---
Description: 
Enable collation support for the *StringLocate* built-in string function in 
Spark. First confirm what is the expected behaviour for these functions when 
given collated strings, and then move on to implementation and testing. One way 
to go about this is to consider using {_}StringSearch{_}, an efficient ICU 
service for string matching. Implement the corresponding unit tests 
(CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how 
this function should be used with collation in SparkSQL, and feel free to use 
your chosen Spark SQL Editor to experiment with the existing functions to learn 
more about how they work. In addition, look into the possible use-cases and 
implementation of similar functions within other other open-source DBMS, such 
as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringLocate* functions so 
that they support all collation types currently supported in Spark. To 
understand what changes were introduced in order to enable full collation 
support for other existing functions in Spark, take a look at the Spark PRs and 
Jira tickets for completed tasks in this parent (for example: Contains, 
StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> StringLocate
> 
>
> Key: SPARK-47567
> URL: https://issues.apache.org/jira/browse/SPARK-47567
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>
> Enable collation support for the *StringLocate* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringLocate* functions so 
> that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47566) SubstringIndex

2024-03-26 Thread Milan Dankovic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830938#comment-17830938
 ] 

Milan Dankovic commented on SPARK-47566:


I am working on this

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47567) StringLocate

2024-03-26 Thread Milan Dankovic (Jira)
Milan Dankovic created SPARK-47567:
--

 Summary: StringLocate
 Key: SPARK-47567
 URL: https://issues.apache.org/jira/browse/SPARK-47567
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Milan Dankovic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Milan Dankovic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Dankovic updated SPARK-47477:
---
Description: 
Enable collation support for the *SubstringIndex* and *StringLocate* built-in 
string functions in Spark. First confirm what is the expected behaviour for 
these functions when given collated strings, and then move on to implementation 
and testing. One way to go about this is to consider using {_}StringSearch{_}, 
an efficient ICU service for string matching. Implement the corresponding unit 
tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *SubstringIndex* and 
*StringLocate* functions so that they support all collation types currently 
supported in Spark. To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the Spark PRs and Jira tickets for completed tasks in this parent (for 
example: Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for the *StringInstr* and *FindInSet* built-in string 
functions in Spark. First confirm what is the expected behaviour for these 
functions when given collated strings, and then move on to implementation and 
testing. One way to go about this is to consider using {_}StringSearch{_}, an 
efficient ICU service for string matching. Implement the corresponding unit 
tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringInstr* and *FindInSet* 
functions so that they support all collation types currently supported in 
Spark. To understand what changes were introduced in order to enable full 
collation support for other existing functions in Spark, take a look at the 
Spark PRs and Jira tickets for completed tasks in this parent (for example: 
Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *SubstringIndex* and *StringLocate* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addit

[jira] [Updated] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-28419:
---
Labels: pull-request-available  (was: )

> A patch for SparkThriftServer support multi-tenant authentication
> -
>
> Key: SPARK-28419
> URL: https://issues.apache.org/jira/browse/SPARK-28419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47566) SubstringIndex

2024-03-26 Thread Milan Dankovic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Dankovic updated SPARK-47566:
---
Description: 
Enable collation support for the *SubstringIndex* built-in string function in 
Spark. First confirm what is the expected behaviour for these functions when 
given collated strings, and then move on to implementation and testing. One way 
to go about this is to consider using {_}StringSearch{_}, an efficient ICU 
service for string matching. Implement the corresponding unit tests 
(CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how 
this function should be used with collation in SparkSQL, and feel free to use 
your chosen Spark SQL Editor to experiment with the existing functions to learn 
more about how they work. In addition, look into the possible use-cases and 
implementation of similar functions within other other open-source DBMS, such 
as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *SubstringIndex* functions so 
that they support all collation types currently supported in Spark. To 
understand what changes were introduced in order to enable full collation 
support for other existing functions in Spark, take a look at the Spark PRs and 
Jira tickets for completed tasks in this parent (for example: Contains, 
StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47566) SubstringIndex

2024-03-26 Thread Milan Dankovic (Jira)
Milan Dankovic created SPARK-47566:
--

 Summary: SubstringIndex
 Key: SPARK-47566
 URL: https://issues.apache.org/jira/browse/SPARK-47566
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Milan Dankovic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47477:
--
Parent: (was: SPARK-46837)
Issue Type: New Feature  (was: Sub-task)

> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47477:
--
Epic Link: SPARK-46830

> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47477:
--
Labels:   (was: pull-request-available)

> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)

2024-03-26 Thread Milan Dankovic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Dankovic updated SPARK-47477:
---
Labels:   (was: pull-request-available)

> SubstringIndex, StringLocate (all collations)
> -
>
> Key: SPARK-47477
> URL: https://issues.apache.org/jira/browse/SPARK-47477
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47431) Add session level default Collation

2024-03-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47431.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45592
[https://github.com/apache/spark/pull/45592]

> Add session level default Collation
> ---
>
> Key: SPARK-47431
> URL: https://issues.apache.org/jira/browse/SPARK-47431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> For now, default session level collation is considered as UTF8_BINARY. In 
> future we want to set this feature with explicit session level configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47565:
---
Labels: pull-request-available  (was: )

> PySpark workers dying in daemon mode idle queue fail query
> --
>
> Key: SPARK-47565
> URL: https://issues.apache.org/jira/browse/SPARK-47565
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.2, 3.5.1, 3.3.4
>Reporter: Sebastian Hillig
>Priority: Major
>  Labels: pull-request-available
>
> PySpark workers may die after entering the idle queue in 
> `PythonWorkerFactory`. This may happen because of code that runs in the 
> process, or external factors.
> When drawn from the warmpool, such a worker will result in an I/O exception 
> on the first read/write .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query

2024-03-26 Thread Nikita Awasthi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830913#comment-17830913
 ] 

Nikita Awasthi commented on SPARK-47565:


User 'sebastianhillig-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/45635

> PySpark workers dying in daemon mode idle queue fail query
> --
>
> Key: SPARK-47565
> URL: https://issues.apache.org/jira/browse/SPARK-47565
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.2, 3.5.1, 3.3.4
>Reporter: Sebastian Hillig
>Priority: Major
>  Labels: pull-request-available
>
> PySpark workers may die after entering the idle queue in 
> `PythonWorkerFactory`. This may happen because of code that runs in the 
> process, or external factors.
> When drawn from the warmpool, such a worker will result in an I/O exception 
> on the first read/write .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query

2024-03-26 Thread Sebastian Hillig (Jira)
Sebastian Hillig created SPARK-47565:


 Summary: PySpark workers dying in daemon mode idle queue fail query
 Key: SPARK-47565
 URL: https://issues.apache.org/jira/browse/SPARK-47565
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.4, 3.5.1, 3.4.2
Reporter: Sebastian Hillig


PySpark workers may die after entering the idle queue in `PythonWorkerFactory`. 
This may happen because of code that runs in the process, or external factors.

When drawn from the warmpool, such a worker will result in an I/O exception on 
the first read/write .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files

2024-03-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47564:
---
Labels: pull-request-available  (was: )

> always throw FAILED_READ_FILE error when fail to read files
> ---
>
> Key: SPARK-47564
> URL: https://issues.apache.org/jira/browse/SPARK-47564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >