[jira] [Updated] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size
[ https://issues.apache.org/jira/browse/SPARK-47612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated SPARK-47612: --- Description: Now we pick up the side of partially clustered distribution: SPJ currently relies on a simple heuristic and always pick the side with less data size based on table statistics as the side fully clustered, even though it could also contain skewed partitions. We can potentially do fine-grained comparison based on partition values, since we have the information now. was: Now we pick up the side of partially clustered distribution: Using plan statistics to determine which side of join to fully cluster partition values. We can optimize to use partition size since we have the information now. > Improve picking the side of partially clustered distribution accroding to > partition size > > > Key: SPARK-47612 > URL: https://issues.apache.org/jira/browse/SPARK-47612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Qi Zhu >Priority: Major > > Now we pick up the side of partially clustered distribution: > SPJ currently relies on a simple heuristic and always pick the side with less > data size based on table statistics as the side fully clustered, even though > it could also contain skewed partitions. > We can potentially do fine-grained comparison based on partition values, > since we have the information now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size
Qi Zhu created SPARK-47612: -- Summary: Improve picking the side of partially clustered distribution accroding to partition size Key: SPARK-47612 URL: https://issues.apache.org/jira/browse/SPARK-47612 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Qi Zhu Now we pick up the side of partially clustered distribution: Using plan statistics to determine which side of join to fully cluster partition values. We can optimize to use partition size since we have the information now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType
[ https://issues.apache.org/jira/browse/SPARK-47611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47611: --- Labels: pull-request-available (was: ) > Cleanup dead code in MySQLDialect.getCatalystType > - > > Key: SPARK-47611 > URL: https://issues.apache.org/jira/browse/SPARK-47611 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType
Kent Yao created SPARK-47611: Summary: Cleanup dead code in MySQLDialect.getCatalystType Key: SPARK-47611 URL: https://issues.apache.org/jira/browse/SPARK-47611 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true
[ https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-47610. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45733 [https://github.com/apache/spark/pull/45733] > Always set io.netty.tryReflectionSetAccessible=true > --- > > Key: SPARK-47610 > URL: https://issues.apache.org/jira/browse/SPARK-47610 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true
[ https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-47610: Assignee: Cheng Pan > Always set io.netty.tryReflectionSetAccessible=true > --- > > Key: SPARK-47610 > URL: https://issues.apache.org/jira/browse/SPARK-47610 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42040. -- Fix Version/s: 4.0.0 Resolution: Fixed > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42040: Assignee: Qi Zhu (was: zhuqi) > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42040: Assignee: zhuqi > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: zhuqi >Priority: Major > Labels: pull-request-available > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47562) Factor literal handling out of `plan.py`
[ https://issues.apache.org/jira/browse/SPARK-47562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-47562. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45719 [https://github.com/apache/spark/pull/45719] > Factor literal handling out of `plan.py` > > > Key: SPARK-47562 > URL: https://issues.apache.org/jira/browse/SPARK-47562 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47562) Factor literal handling out of `plan.py`
[ https://issues.apache.org/jira/browse/SPARK-47562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-47562: - Assignee: Ruifeng Zheng > Factor literal handling out of `plan.py` > > > Key: SPARK-47562 > URL: https://issues.apache.org/jira/browse/SPARK-47562 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47570) Integrate range scan encoder changes with timer implementation
[ https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-47570: Assignee: Jing Zhan > Integrate range scan encoder changes with timer implementation > -- > > Key: SPARK-47570 > URL: https://issues.apache.org/jira/browse/SPARK-47570 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Assignee: Jing Zhan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47570) Integrate range scan encoder changes with timer implementation
[ https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47570. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45709 [https://github.com/apache/spark/pull/45709] > Integrate range scan encoder changes with timer implementation > -- > > Key: SPARK-47570 > URL: https://issues.apache.org/jira/browse/SPARK-47570 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Assignee: Jing Zhan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47273) Implement python stream writer interface
[ https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47273. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45305 [https://github.com/apache/spark/pull/45305] > Implement python stream writer interface > > > Key: SPARK-47273 > URL: https://issues.apache.org/jira/browse/SPARK-47273 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In order to support developing spark streaming sink in python, we need to > implement python stream writer interface. > Reuse PythonPartitionWriter to implement the serialization and execution of > write callback in executor. > Implement python worker process to run python streaming data sink committer > and communicate with JVM through socket in spark driver. For each python > streaming data sink instance there will be a long live python worker process > created. Inside the python process, the python write committer will receive > abort or commit function call and send back result through socket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47273) Implement python stream writer interface
[ https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-47273: Assignee: Chaoqin Li > Implement python stream writer interface > > > Key: SPARK-47273 > URL: https://issues.apache.org/jira/browse/SPARK-47273 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > In order to support developing spark streaming sink in python, we need to > implement python stream writer interface. > Reuse PythonPartitionWriter to implement the serialization and execution of > write callback in executor. > Implement python worker process to run python streaming data sink committer > and communicate with JVM through socket in spark driver. For each python > streaming data sink instance there will be a long live python worker process > created. Inside the python process, the python write committer will receive > abort or commit function call and send back result through socket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47498) Refine some fractional GPU resource calculation tests.
[ https://issues.apache.org/jira/browse/SPARK-47498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wu Yi reassigned SPARK-47498: - Assignee: Bobby Wang > Refine some fractional GPU resource calculation tests. > -- > > Key: SPARK-47498 > URL: https://issues.apache.org/jira/browse/SPARK-47498 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47498) Refine some fractional GPU resource calculation tests.
[ https://issues.apache.org/jira/browse/SPARK-47498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wu Yi resolved SPARK-47498. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45631 [https://github.com/apache/spark/pull/45631] > Refine some fractional GPU resource calculation tests. > -- > > Key: SPARK-47498 > URL: https://issues.apache.org/jira/browse/SPARK-47498 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true
[ https://issues.apache.org/jira/browse/SPARK-47610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47610: --- Labels: pull-request-available (was: ) > Always set io.netty.tryReflectionSetAccessible=true > --- > > Key: SPARK-47610 > URL: https://issues.apache.org/jira/browse/SPARK-47610 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47610) Always set io.netty.tryReflectionSetAccessible=true
Cheng Pan created SPARK-47610: - Summary: Always set io.netty.tryReflectionSetAccessible=true Key: SPARK-47610 URL: https://issues.apache.org/jira/browse/SPARK-47610 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan
[ https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47609: - Description: This issue became apparent while bringing my PR [https://github.com/apache/spark/pull/43854] in synch with latest master. Though that PR is meant to do early collapse of projects so that the tree size is kept at minimum when projects keep getting added , in the analyzer phase itself. But as part of the work, the CacheManager lookup also needed to be modified. One of the newly added test in master failed. On analysis of failure it turns out that the cache manager is not picking cached InMemoryRelation for a subplan. This shows up in following existing test org.apache.spark.sql.DatasetCacheSuite {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() {color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color} df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) {color:#00875a}// Verify that df1's cache has stayed the same, since df1's cache already has data{color} // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i}{quote} {quote}*{color:#de350b}// This assertion is not right{color}* assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) } {quote} Since df1 exists in the cache as InMemoryRelation, val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df2 is derivable from the cached df1. So when val df2Limit = df2.limit(2), is created, it should utilize the cached df1 . was: This issue became apparent while bringing my PR [https://github.com/apache/spark/pull/43854] in synch with latest master. Though that PR is meant to do early collapse of projects so that the tree size is kept at minimum when projects keep getting added , in the analyzer phase itself. But as part of the work, the CacheManager lookup also needed to be modified. One of the newly added test in master failed. On analysis of failure it turns out that the cache manager is not picking cached InMemoryRelation for a subplan. This shows up in following existing test {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() {color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color} df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) {color:#00875a}// Verify that df1's cache has stayed the same, since df1's cache already has data{color} // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan }{quote} {quote}*{color:#de350b}// This assertion is not right{color}* assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) }{quote}
[jira] [Created] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan
Asif created SPARK-47609: Summary: CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan Key: SPARK-47609 URL: https://issues.apache.org/jira/browse/SPARK-47609 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1 Reporter: Asif This issue became apparent while bringing my PR [https://github.com/apache/spark/pull/43854] in synch with latest master. Though that PR is meant to do early collapse of projects so that the tree size is kept at minimum when projects keep getting added , in the analyzer phase itself. But as part of the work, the CacheManager lookup also needed to be modified. One of the newly added test in master failed. On analysis of failure it turns out that the cache manager is not picking cached InMemoryRelation for a subplan. This shows up in following existing test {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() {color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color} df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) {color:#00875a}// Verify that df1's cache has stayed the same, since df1's cache already has data{color} // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan }{quote} {quote}*{color:#de350b}// This assertion is not right{color}* assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) }{quote} Since df1 exists in the cache as InMemoryRelation, val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df2 is derivable from the cached df1. So when val df2Limit = df2.limit(2), is created, it should utilize the cached df1 . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
[ https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831116#comment-17831116 ] Asif edited comment on SPARK-26708 at 3/27/24 12:58 AM: I believe the current caching logic is suboptimal and accordingly the bug test for it is testing a suboptimal approach. The bug test for this is {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() // After calling collect(), df1's buffer has been loaded. df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) // Verify that df1's cache has stayed the same, since df1's cache already has data // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) } {quote} The optimal caching should have resulted in df2LimitInnerPlan actually containing InMemoryTableScanExec which should have corresponded to df1. The reason being that since df1 was already materialized, so it exists in the cache rightly. And df2 is derivable from the cached df1 ( it just has extra projection but otherwise can serve the df2). was (Author: ashahid7): I believe the current caching logic is suboptimal and accordingly the bug test for it is testing a suboptimal approach. The bug test for this is {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() // After calling collect(), df1's buffer has been loaded. df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) // Verify that df1's cache has stayed the same, since df1's cache already has data // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) }{quote} The optimal caching should have resulted in df2LimitInnerPlan actually containing InMemoryTableScanExec which should have corresponded to df1. The reason being that since df2 was already materialized, so it exists in the cache rightly. And df2 is derivable from the cached df1 ( it just has extra projection but otherwise can serve the df2). > Incorrect result caused by inconsistency between a SQL cache's cached RDD and > its physical plan > --- > > Key: SPARK-26708 > URL: https://issues.apache.org/jira/browse/SPARK-26708 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Wei Xue >Priority: Blocker > Labels: correctness > Fix For: 2.4.1, 3.0.0 >
[jira] [Comment Edited] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
[ https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831117#comment-17831117 ] Asif edited comment on SPARK-26708 at 3/27/24 12:54 AM: Towards that please take a look at ticket & PR: https://issues.apache.org/jira/browse/SPARK-45959 and the PR associated with it. Though that PR primarily deals with aggressive collapse of projects at the end of analysis . But it also as part of fix, uses enhanced cached plan lookup and thus results in the above behaviour. was (Author: ashahid7): Towards that please take a look at ticket & PR: [https://issues.apache.org/jira/browse/SPARK-45959|https://issues.apache.org/jira/browse/SPARK-45959] > Incorrect result caused by inconsistency between a SQL cache's cached RDD and > its physical plan > --- > > Key: SPARK-26708 > URL: https://issues.apache.org/jira/browse/SPARK-26708 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Wei Xue >Priority: Blocker > Labels: correctness > Fix For: 2.4.1, 3.0.0 > > > When performing non-cascading cache invalidation, {{recache}} is called on > the other cache entries which are dependent on the cache being invalidated. > It leads to the the physical plans of those cache entries being re-compiled. > For those cache entries, if the cache RDD has already been persisted, chances > are there will be inconsistency between the data and the new plan. It can > cause a correctness issue if the new plan's {{outputPartitioning}} or > {{outputOrdering}} is different from the that of the actual data, and > meanwhile the cache is used by another query that asks for specific > {{outputPartitioning}} or {{outputOrdering}} which happens to match the new > plan but not the actual data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
[ https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831116#comment-17831116 ] Asif commented on SPARK-26708: -- I believe the current caching logic is suboptimal and accordingly the bug test for it is testing a suboptimal approach. The bug test for this is {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() // After calling collect(), df1's buffer has been loaded. df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) // Verify that df1's cache has stayed the same, since df1's cache already has data // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst { case i: InMemoryRelation => i.cacheBuilder.cachedPlan } assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) }{quote} The optimal caching should have resulted in df2LimitInnerPlan actually containing InMemoryTableScanExec which should have corresponded to df1. The reason being that since df2 was already materialized, so it exists in the cache rightly. And df2 is derivable from the cached df1 ( it just has extra projection but otherwise can serve the df2). > Incorrect result caused by inconsistency between a SQL cache's cached RDD and > its physical plan > --- > > Key: SPARK-26708 > URL: https://issues.apache.org/jira/browse/SPARK-26708 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Wei Xue >Priority: Blocker > Labels: correctness > Fix For: 2.4.1, 3.0.0 > > > When performing non-cascading cache invalidation, {{recache}} is called on > the other cache entries which are dependent on the cache being invalidated. > It leads to the the physical plans of those cache entries being re-compiled. > For those cache entries, if the cache RDD has already been persisted, chances > are there will be inconsistency between the data and the new plan. It can > cause a correctness issue if the new plan's {{outputPartitioning}} or > {{outputOrdering}} is different from the that of the actual data, and > meanwhile the cache is used by another query that asks for specific > {{outputPartitioning}} or {{outputOrdering}} which happens to match the new > plan but not the actual data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47561) fix analyzer rule order issues about Alias
[ https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-47561: Fix Version/s: 3.5.2 > fix analyzer rule order issues about Alias > -- > > Key: SPARK-47561 > URL: https://issues.apache.org/jira/browse/SPARK-47561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState
[ https://issues.apache.org/jira/browse/SPARK-47558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47558: --- Labels: pull-request-available (was: ) > [Arbitrary State Support] State TTL support - ValueState > > > Key: SPARK-47558 > URL: https://issues.apache.org/jira/browse/SPARK-47558 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > > Add support for expiring state value based on ttl for Value State in > transformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47608) Improve user experience of loading logs as json data source
Gengliang Wang created SPARK-47608: -- Summary: Improve user experience of loading logs as json data source Key: SPARK-47608 URL: https://issues.apache.org/jira/browse/SPARK-47608 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang E.g. create a constant table schema in object Loggig so that users can query the json log files easily. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47607) Add documentation for Structured logging framework
Gengliang Wang created SPARK-47607: -- Summary: Add documentation for Structured logging framework Key: SPARK-47607 URL: https://issues.apache.org/jira/browse/SPARK-47607 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47606) Create log4j templates for both structured logging and plain text logging
Gengliang Wang created SPARK-47606: -- Summary: Create log4j templates for both structured logging and plain text logging Key: SPARK-47606 URL: https://issues.apache.org/jira/browse/SPARK-47606 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47605) Enable structured logging in all the test log4j2.properties
Gengliang Wang created SPARK-47605: -- Summary: Enable structured logging in all the test log4j2.properties Key: SPARK-47605 URL: https://issues.apache.org/jira/browse/SPARK-47605 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47604) Resource managers: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47604: -- Summary: Resource managers: Migrate logInfo with variables to structured logging framework Key: SPARK-47604 URL: https://issues.apache.org/jira/browse/SPARK-47604 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47600) MLLib: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47600: -- Summary: MLLib: Migrate logInfo with variables to structured logging framework Key: SPARK-47600 URL: https://issues.apache.org/jira/browse/SPARK-47600 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47601) Graphx: Migrate logs with variables to structured logging framework
Gengliang Wang created SPARK-47601: -- Summary: Graphx: Migrate logs with variables to structured logging framework Key: SPARK-47601 URL: https://issues.apache.org/jira/browse/SPARK-47601 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47598) MLLib: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47598: -- Summary: MLLib: Migrate logError with variables to structured logging framework Key: SPARK-47598 URL: https://issues.apache.org/jira/browse/SPARK-47598 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47599) MLLib: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47599: -- Summary: MLLib: Migrate logWarn with variables to structured logging framework Key: SPARK-47599 URL: https://issues.apache.org/jira/browse/SPARK-47599 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47594) Connector module: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47594: -- Summary: Connector module: Migrate logInfo with variables to structured logging framework Key: SPARK-47594 URL: https://issues.apache.org/jira/browse/SPARK-47594 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47592) Connector module: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47592: -- Summary: Connector module: Migrate logError with variables to structured logging framework Key: SPARK-47592 URL: https://issues.apache.org/jira/browse/SPARK-47592 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47593) Connector module: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47593: -- Summary: Connector module: Migrate logWarn with variables to structured logging framework Key: SPARK-47593 URL: https://issues.apache.org/jira/browse/SPARK-47593 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47590) Hive-thriftserver: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47590: -- Summary: Hive-thriftserver: Migrate logWarn with variables to structured logging framework Key: SPARK-47590 URL: https://issues.apache.org/jira/browse/SPARK-47590 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47591) Hive-thriftserver: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47591: -- Summary: Hive-thriftserver: Migrate logInfo with variables to structured logging framework Key: SPARK-47591 URL: https://issues.apache.org/jira/browse/SPARK-47591 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47588) Hive module: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47588: -- Summary: Hive module: Migrate logInfo with variables to structured logging framework Key: SPARK-47588 URL: https://issues.apache.org/jira/browse/SPARK-47588 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47583) SQL core: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47583: -- Summary: SQL core: Migrate logError with variables to structured logging framework Key: SPARK-47583 URL: https://issues.apache.org/jira/browse/SPARK-47583 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47589) Hive-thriftserver: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47589: -- Summary: Hive-thriftserver: Migrate logError with variables to structured logging framework Key: SPARK-47589 URL: https://issues.apache.org/jira/browse/SPARK-47589 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47586) Hive module: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47586: -- Summary: Hive module: Migrate logError with variables to structured logging framework Key: SPARK-47586 URL: https://issues.apache.org/jira/browse/SPARK-47586 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47587) Hive module: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47587: -- Summary: Hive module: Migrate logWarn with variables to structured logging framework Key: SPARK-47587 URL: https://issues.apache.org/jira/browse/SPARK-47587 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47585) SQL core: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47585: -- Summary: SQL core: Migrate logInfo with variables to structured logging framework Key: SPARK-47585 URL: https://issues.apache.org/jira/browse/SPARK-47585 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47584) SQL core: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47584: -- Summary: SQL core: Migrate logWarn with variables to structured logging framework Key: SPARK-47584 URL: https://issues.apache.org/jira/browse/SPARK-47584 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47582) SQL catalyst: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47582: -- Summary: SQL catalyst: Migrate logInfo with variables to structured logging framework Key: SPARK-47582 URL: https://issues.apache.org/jira/browse/SPARK-47582 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47581) SQL catalyst: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47581: -- Summary: SQL catalyst: Migrate logWarn with variables to structured logging framework Key: SPARK-47581 URL: https://issues.apache.org/jira/browse/SPARK-47581 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47579) Spark core: Migrate logInfo with variables to structured logging framework
Gengliang Wang created SPARK-47579: -- Summary: Spark core: Migrate logInfo with variables to structured logging framework Key: SPARK-47579 URL: https://issues.apache.org/jira/browse/SPARK-47579 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47580) SQL catalyst: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47580: -- Summary: SQL catalyst: Migrate logError with variables to structured logging framework Key: SPARK-47580 URL: https://issues.apache.org/jira/browse/SPARK-47580 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47577) Spark core: Migrate logError with variables to structured logging framework
Gengliang Wang created SPARK-47577: -- Summary: Spark core: Migrate logError with variables to structured logging framework Key: SPARK-47577 URL: https://issues.apache.org/jira/browse/SPARK-47577 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47578) Spark core: Migrate logWarn with variables to structured logging framework
Gengliang Wang created SPARK-47578: -- Summary: Spark core: Migrate logWarn with variables to structured logging framework Key: SPARK-47578 URL: https://issues.apache.org/jira/browse/SPARK-47578 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47575) Implement logWarn API in structured logging framework
Gengliang Wang created SPARK-47575: -- Summary: Implement logWarn API in structured logging framework Key: SPARK-47575 URL: https://issues.apache.org/jira/browse/SPARK-47575 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47576) Implement logInfo API in structured logging framework
Gengliang Wang created SPARK-47576: -- Summary: Implement logInfo API in structured logging framework Key: SPARK-47576 URL: https://issues.apache.org/jira/browse/SPARK-47576 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sundeep K updated SPARK-47556: -- Description: h3. Issue: We noticed that sometimes K8s executor pods go in a crash loop. Reason being 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon investigation we noticed that there are 2 spark jobs that launched with same application id and when one of them finishes first it deletes all it's resources and deletes the resources of other job too. -> Spark application ID is created using this [code|https://github.com/apache/spark/blob/36126a5c1821b4418afd5788963a939ea7f64078/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L38] "spark-application-" + System.currentTimeMillis This means if 2 applications launch at the same milli second they could end up having same AppId -> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] label is added to all resource created by driver and it's value is application Id. Kubernetes Scheduler deletes all the apps with same [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] upon termination. This results in deletion of config map and executor pods of job that's still running, driver tries to relaunch the executor pods, but config map is not present, so it's in crash loop h3. Context We are using [Spark of Kubernetes |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s namespace. Each Spark job can be launched from different pods or from different processes in a pod. Every time a job is launched it has a unique app name. Here is how the job is launched (omitting irrelevant details): {code:java} # spark_conf has settings required for spark on k8s sp = SparkSession.builder \ .config(conf=spark_conf) \ .appName('testapp') sp.master(f'k8s://{kubernetes_host}') session = sp.getOrCreate() with session: session.sql('SELECT 1'){code} h3. Repro Set same app id in spark config, run 2 different jobs, one that finishes fast, one that runs slow. Slower job goes into crash loop {code:java} "spark.app.id": ""{code} h3. Workaround Set unique spark.app.id for all the jobs that run on k8s eg: {code:java} "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} h3. Fix Add unique hash add the end of Application ID: [https://github.com/apache/spark/pull/45712] was: h3. Issue: We noticed that sometimes K8s executor pods go in a crash loop. Reason being 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon investigation we noticed that there are 2 spark jobs that launched with same application id and when one of them finishes first it deletes all it's resources and deletes the resources of other job too. -> Spark application ID is created using this [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] "spark-application-" + System.currentTimeMillis This means if 2 applications launch at the same milli second they could end up having same AppId -> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] label is added to all resource created by driver and it's value is application Id. Kubernetes Scheduler deletes all the apps with same [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] upon termination. This results in deletion of config map and executor pods of job that's still running, driver tries to relaunch the executor pods, but config map is not present, so it's in crash loop h3. Context We are using [Spark of Kubernetes |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s namespace. Each Spark job can be launched from different pods or from different processes in a pod. Every time a job is launched it has a unique app name. Here is how the job is launched (omitting irrelevant details): {code:java} # spark_conf has settings required for spark on k8s sp = SparkSession.builder \ .config(conf=spark_conf) \ .appName('testapp') sp.master(f'k8s://{kubernetes_host}') session = sp.getOrCreate() with session: session.sql('SELECT 1'){code} h3. Repro Set
[jira] [Updated] (SPARK-47572) Enforce Window partitionSpec is orderable.
[ https://issues.apache.org/jira/browse/SPARK-47572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47572: --- Labels: pull-request-available (was: ) > Enforce Window partitionSpec is orderable. > -- > > Key: SPARK-47572 > URL: https://issues.apache.org/jira/browse/SPARK-47572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.1, 3.3.4 >Reporter: Chenhao Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47574) Introduce Structured Logging Framework
[ https://issues.apache.org/jira/browse/SPARK-47574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47574: --- Labels: pull-request-available (was: ) > Introduce Structured Logging Framework > -- > > Key: SPARK-47574 > URL: https://issues.apache.org/jira/browse/SPARK-47574 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > > Introduce Structured Logging Framework as per > [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing] > . > * The default logging output format will be json lines. For example > {code:java} > { > "ts":"2023-03-12T12:02:46.661-0700", > "level":"ERROR", > "msg":"Cannot determine whether executor 289 is alive or not", > "context":{ > "executor_id":"289" > }, > "exception":{ > "class":"org.apache.spark.SparkException", > "msg":"Exception thrown in awaitResult", > "stackTrace":"..." > }, > "source":"BlockManagerMasterEndpoint" > } {code} > * Introduce a new configuration `spark.log.structuredLogging.enabled` to > control the default log4j configuration. Users can set it as false to get > plain text log outputs > * The change will start with logError method. Example changes on the API: > from > `logError(s"Cannot determine whether executor $executorId is alive or not.", > e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, > executorId)} is alive or not.", e)` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47573) Support custom driver log URLs for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-47573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47573: --- Labels: pull-request-available (was: ) > Support custom driver log URLs for Kubernetes > - > > Key: SPARK-47573 > URL: https://issues.apache.org/jira/browse/SPARK-47573 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Enrico Minack >Priority: Major > Labels: pull-request-available > > Spark provides the option to set the URL for *executor* logs via > {{spark.ui.custom.executor.log.url}}. This should be possible for *driver* > logs as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47574) Introduce Structured Logging Framework
[ https://issues.apache.org/jira/browse/SPARK-47574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-47574: --- Description: Introduce Structured Logging Framework as per [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing] . * The default logging output format will be json lines. For example {code:java} { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } {code} * Introduce a new configuration `spark.log.structuredLogging.enabled` to control the default log4j configuration. Users can set it as false to get plain text log outputs * The change will start with logError method. Example changes on the API: from `logError(s"Cannot determine whether executor $executorId is alive or not.", e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e)` was: Introduce Structured Logging Framework as per [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing] . * The default logging output format will be json lines. For example {code:java} { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } {code} * Introduce a new configuration `spark.log.structuredLogging.enabled` to control the default log4j configuration. Users can set it as false to get plain text log outputs * The change will start with logError method. The Logging API will be changed from `logError(s"Cannot determine whether executor $executorId is alive or not.", e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e)` > Introduce Structured Logging Framework > -- > > Key: SPARK-47574 > URL: https://issues.apache.org/jira/browse/SPARK-47574 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Introduce Structured Logging Framework as per > [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing] > . > * The default logging output format will be json lines. For example > {code:java} > { > "ts":"2023-03-12T12:02:46.661-0700", > "level":"ERROR", > "msg":"Cannot determine whether executor 289 is alive or not", > "context":{ > "executor_id":"289" > }, > "exception":{ > "class":"org.apache.spark.SparkException", > "msg":"Exception thrown in awaitResult", > "stackTrace":"..." > }, > "source":"BlockManagerMasterEndpoint" > } {code} > * Introduce a new configuration `spark.log.structuredLogging.enabled` to > control the default log4j configuration. Users can set it as false to get > plain text log outputs > * The change will start with logError method. Example changes on the API: > from > `logError(s"Cannot determine whether executor $executorId is alive or not.", > e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, > executorId)} is alive or not.", e)` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47574) Introduce Structured Logging Framework
Gengliang Wang created SPARK-47574: -- Summary: Introduce Structured Logging Framework Key: SPARK-47574 URL: https://issues.apache.org/jira/browse/SPARK-47574 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Introduce Structured Logging Framework as per [https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing] . * The default logging output format will be json lines. For example {code:java} { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } {code} * Introduce a new configuration `spark.log.structuredLogging.enabled` to control the default log4j configuration. Users can set it as false to get plain text log outputs * The change will start with logError method. The Logging API will be changed from `logError(s"Cannot determine whether executor $executorId is alive or not.", e)` to `logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e)` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47572) Enforce Window partitionSpec is orderable.
Chenhao Li created SPARK-47572: -- Summary: Enforce Window partitionSpec is orderable. Key: SPARK-47572 URL: https://issues.apache.org/jira/browse/SPARK-47572 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.4, 3.5.1, 3.4.1 Reporter: Chenhao Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47571) date_format() java.lang.ArithmeticException: long overflow for large dates
Serge Rielau created SPARK-47571: Summary: date_format() java.lang.ArithmeticException: long overflow for large dates Key: SPARK-47571 URL: https://issues.apache.org/jira/browse/SPARK-47571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau The following works for CATS, but not for DATE_FORMAT(): select cast(cast('5881580' AS DATE) AS STRING); +5881580-01-01 spark-sql (default)> select date_format(cast('5881580' AS DATE), 'yyy-mm-dd'); 24/03/26 11:08:23 ERROR SparkSQLDriver: Failed in [select date_format(cast('5881580' AS DATE), 'yyy-mm-dd')] java.lang.ArithmeticException: long overflow at java.base/java.lang.Math.multiplyExact(Math.java:1004) at org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.instantToMicros(SparkDateTimeUtils.scala:122) at org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.instantToMicros$(SparkDateTimeUtils.scala:116) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.instantToMicros(DateTimeUtils.scala:41) at org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.daysToMicros(SparkDateTimeUtils.scala:174) at org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.daysToMicros$(SparkDateTimeUtils.scala:172) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.daysToMicros(DateTimeUtils.scala:41) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToTimestamp$14(Cast.scala:642) at scala.runtime.java8.JFunction1$mcJI$sp.apply(JFunction1$mcJI$sp.scala:17) at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:557) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToTimestamp$13(Cast.scala:642) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1170) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:558) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47570) Integrate range scan encoder changes with timer implementation
[ https://issues.apache.org/jira/browse/SPARK-47570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47570: --- Labels: pull-request-available (was: ) > Integrate range scan encoder changes with timer implementation > -- > > Key: SPARK-47570 > URL: https://issues.apache.org/jira/browse/SPARK-47570 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47570) Integrate range scan encoder changes with timer implementation
Jing Zhan created SPARK-47570: - Summary: Integrate range scan encoder changes with timer implementation Key: SPARK-47570 URL: https://issues.apache.org/jira/browse/SPARK-47570 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jing Zhan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47569) Disallow comparing variant.
[ https://issues.apache.org/jira/browse/SPARK-47569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47569: --- Labels: pull-request-available (was: ) > Disallow comparing variant. > --- > > Key: SPARK-47569 > URL: https://issues.apache.org/jira/browse/SPARK-47569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Chenhao Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47569) Disallow comparing variant.
Chenhao Li created SPARK-47569: -- Summary: Disallow comparing variant. Key: SPARK-47569 URL: https://issues.apache.org/jira/browse/SPARK-47569 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Chenhao Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47555) Show a warning message about SQLException if `JDBCTableCatalog.loadTable` fails
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47555: -- Summary: Show a warning message about SQLException if `JDBCTableCatalog.loadTable` fails (was: Record necessary raw exception log when loadTable) > Show a warning message about SQLException if `JDBCTableCatalog.loadTable` > fails > --- > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: xleoken >Assignee: xleoken >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47566) SubstringIndex
[ https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47566: --- Labels: pull-request-available (was: ) > SubstringIndex > -- > > Key: SPARK-47566 > URL: https://issues.apache.org/jira/browse/SPARK-47566 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Milan Dankovic >Priority: Major > Labels: pull-request-available > > Enable collation support for the *SubstringIndex* built-in string function in > Spark. First confirm what is the expected behaviour for these functions when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *SubstringIndex* functions > so that they support all collation types currently supported in Spark. To > understand what changes were introduced in order to enable full collation > support for other existing functions in Spark, take a look at the Spark PRs > and Jira tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47555: - Assignee: xleoken > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: xleoken >Assignee: xleoken >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47568) Fix race condition between maintenance thread and task thead for RocksDB snapshot
[ https://issues.apache.org/jira/browse/SPARK-47568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47568: --- Labels: pull-request-available (was: ) > Fix race condition between maintenance thread and task thead for RocksDB > snapshot > - > > Key: SPARK-47568 > URL: https://issues.apache.org/jira/browse/SPARK-47568 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > > There are currently some race conditions between maintenance thread and task > thread which can result in corrupted checkpoint state. > # The maintenance thread currently relies on class variable {{lastSnapshot}} > to find the latest checkpoint and uploads it to DFS. This checkpoint can be > modified at commit time by Task thread if a new snapshot is created. > # The task thread does not reset lastSnapshot at load time, which can result > in newer snapshots (if a old version is loaded) being considered valid and > uploaded to DFS. This results in VersionIdMismatch errors. > This issue proposes to fix these issues by guarding latestSnapshot variable > modification, and setting latestSnapshot properly at load time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47555: -- Parent: SPARK-47361 Issue Type: Sub-task (was: Improvement) > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: xleoken >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47555. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45711 [https://github.com/apache/spark/pull/45711] > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: xleoken >Assignee: xleoken >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47568) Fix race condition between maintenance thread and task thead for RocksDB snapshot
Bhuwan Sahni created SPARK-47568: Summary: Fix race condition between maintenance thread and task thead for RocksDB snapshot Key: SPARK-47568 URL: https://issues.apache.org/jira/browse/SPARK-47568 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.5.1, 3.5.0, 4.0.0, 3.5.2 Reporter: Bhuwan Sahni There are currently some race conditions between maintenance thread and task thread which can result in corrupted checkpoint state. # The maintenance thread currently relies on class variable {{lastSnapshot}} to find the latest checkpoint and uploads it to DFS. This checkpoint can be modified at commit time by Task thread if a new snapshot is created. # The task thread does not reset lastSnapshot at load time, which can result in newer snapshots (if a old version is loaded) being considered valid and uploaded to DFS. This results in VersionIdMismatch errors. This issue proposes to fix these issues by guarding latestSnapshot variable modification, and setting latestSnapshot properly at load time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47555: -- Affects Version/s: 4.0.0 (was: 3.5.1) > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: xleoken >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState
[ https://issues.apache.org/jira/browse/SPARK-47558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830977#comment-17830977 ] Bhuwan Sahni commented on SPARK-47558: -- https://github.com/apache/spark/pull/45674 > [Arbitrary State Support] State TTL support - ValueState > > > Key: SPARK-47558 > URL: https://issues.apache.org/jira/browse/SPARK-47558 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Bhuwan Sahni >Priority: Major > > Add support for expiring state value based on ttl for Value State in > transformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47561) fix analyzer rule order issues about Alias
[ https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47561: - Assignee: Wenchen Fan > fix analyzer rule order issues about Alias > -- > > Key: SPARK-47561 > URL: https://issues.apache.org/jira/browse/SPARK-47561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47561) fix analyzer rule order issues about Alias
[ https://issues.apache.org/jira/browse/SPARK-47561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47561. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45718 [https://github.com/apache/spark/pull/45718] > fix analyzer rule order issues about Alias > -- > > Key: SPARK-47561 > URL: https://issues.apache.org/jira/browse/SPARK-47561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense
[ https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47544. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45700 [https://github.com/apache/spark/pull/45700] > [Pyspark] SparkSession builder method is incompatible with vs code > intellisense > --- > > Key: SPARK-47544 > URL: https://issues.apache.org/jira/browse/SPARK-47544 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: old.mov > > > VS code's intellisense is unable to recognize the methods under > `SparkSession.builder`. > > See attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense
[ https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47544: - Assignee: Niranjan Jayakar > [Pyspark] SparkSession builder method is incompatible with vs code > intellisense > --- > > Key: SPARK-47544 > URL: https://issues.apache.org/jira/browse/SPARK-47544 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Labels: pull-request-available > Attachments: old.mov > > > VS code's intellisense is unable to recognize the methods under > `SparkSession.builder`. > > See attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47557) Audit MySQL ENUM/SET Types
[ https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47557: - Assignee: Kent Yao > Audit MySQL ENUM/SET Types > -- > > Key: SPARK-47557 > URL: https://issues.apache.org/jira/browse/SPARK-47557 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47557) Audit MySQL ENUM/SET Types
[ https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47557. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45713 [https://github.com/apache/spark/pull/45713] > Audit MySQL ENUM/SET Types > -- > > Key: SPARK-47557 > URL: https://issues.apache.org/jira/browse/SPARK-47557 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830952#comment-17830952 ] Gideon P commented on SPARK-47413: -- [~davidm-db] Awesome, thanks! Do you have any guidance BTW as to when I should try to get this completed by? > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829408#comment-17829408 ] Milan Dankovic edited comment on SPARK-47477 at 3/26/24 1:54 PM: - I am working on SubstringIndex sub-task was (Author: JIRAUSER304529): I am working on this > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *SubstringIndex* and *StringLocate* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *SubstringIndex* and > *StringLocate* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47567) StringLocate
[ https://issues.apache.org/jira/browse/SPARK-47567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Dankovic updated SPARK-47567: --- Description: Enable collation support for the *StringLocate* built-in string function in Spark. First confirm what is the expected behaviour for these functions when given collated strings, and then move on to implementation and testing. One way to go about this is to consider using {_}StringSearch{_}, an efficient ICU service for string matching. Implement the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringLocate* functions so that they support all collation types currently supported in Spark. To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the Spark PRs and Jira tickets for completed tasks in this parent (for example: Contains, StartsWith, EndsWith). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU user guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] and [ICU docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. Also, refer to the Unicode Technical Standard for string [searching|https://www.unicode.org/reports/tr10/#Searching] and [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > StringLocate > > > Key: SPARK-47567 > URL: https://issues.apache.org/jira/browse/SPARK-47567 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Milan Dankovic >Priority: Major > > Enable collation support for the *StringLocate* built-in string function in > Spark. First confirm what is the expected behaviour for these functions when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringLocate* functions so > that they support all collation types currently supported in Spark. To > understand what changes were introduced in order to enable full collation > support for other existing functions in Spark, take a look at the Spark PRs > and Jira tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47566) SubstringIndex
[ https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830938#comment-17830938 ] Milan Dankovic commented on SPARK-47566: I am working on this > SubstringIndex > -- > > Key: SPARK-47566 > URL: https://issues.apache.org/jira/browse/SPARK-47566 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Milan Dankovic >Priority: Major > > Enable collation support for the *SubstringIndex* built-in string function in > Spark. First confirm what is the expected behaviour for these functions when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *SubstringIndex* functions > so that they support all collation types currently supported in Spark. To > understand what changes were introduced in order to enable full collation > support for other existing functions in Spark, take a look at the Spark PRs > and Jira tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47567) StringLocate
Milan Dankovic created SPARK-47567: -- Summary: StringLocate Key: SPARK-47567 URL: https://issues.apache.org/jira/browse/SPARK-47567 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Milan Dankovic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Dankovic updated SPARK-47477: --- Description: Enable collation support for the *SubstringIndex* and *StringLocate* built-in string functions in Spark. First confirm what is the expected behaviour for these functions when given collated strings, and then move on to implementation and testing. One way to go about this is to consider using {_}StringSearch{_}, an efficient ICU service for string matching. Implement the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *SubstringIndex* and *StringLocate* functions so that they support all collation types currently supported in Spark. To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the Spark PRs and Jira tickets for completed tasks in this parent (for example: Contains, StartsWith, EndsWith). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU user guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] and [ICU docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. Also, refer to the Unicode Technical Standard for string [searching|https://www.unicode.org/reports/tr10/#Searching] and [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. was: Enable collation support for the *StringInstr* and *FindInSet* built-in string functions in Spark. First confirm what is the expected behaviour for these functions when given collated strings, and then move on to implementation and testing. One way to go about this is to consider using {_}StringSearch{_}, an efficient ICU service for string matching. Implement the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringInstr* and *FindInSet* functions so that they support all collation types currently supported in Spark. To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the Spark PRs and Jira tickets for completed tasks in this parent (for example: Contains, StartsWith, EndsWith). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU user guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] and [ICU docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. Also, refer to the Unicode Technical Standard for string [searching|https://www.unicode.org/reports/tr10/#Searching] and [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *SubstringIndex* and *StringLocate* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addit
[jira] [Updated] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication
[ https://issues.apache.org/jira/browse/SPARK-28419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-28419: --- Labels: pull-request-available (was: ) > A patch for SparkThriftServer support multi-tenant authentication > - > > Key: SPARK-28419 > URL: https://issues.apache.org/jira/browse/SPARK-28419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47566) SubstringIndex
[ https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Dankovic updated SPARK-47566: --- Description: Enable collation support for the *SubstringIndex* built-in string function in Spark. First confirm what is the expected behaviour for these functions when given collated strings, and then move on to implementation and testing. One way to go about this is to consider using {_}StringSearch{_}, an efficient ICU service for string matching. Implement the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *SubstringIndex* functions so that they support all collation types currently supported in Spark. To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the Spark PRs and Jira tickets for completed tasks in this parent (for example: Contains, StartsWith, EndsWith). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU user guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] and [ICU docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. Also, refer to the Unicode Technical Standard for string [searching|https://www.unicode.org/reports/tr10/#Searching] and [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > SubstringIndex > -- > > Key: SPARK-47566 > URL: https://issues.apache.org/jira/browse/SPARK-47566 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Milan Dankovic >Priority: Major > > Enable collation support for the *SubstringIndex* built-in string function in > Spark. First confirm what is the expected behaviour for these functions when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *SubstringIndex* functions > so that they support all collation types currently supported in Spark. To > understand what changes were introduced in order to enable full collation > support for other existing functions in Spark, take a look at the Spark PRs > and Jira tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47566) SubstringIndex
Milan Dankovic created SPARK-47566: -- Summary: SubstringIndex Key: SPARK-47566 URL: https://issues.apache.org/jira/browse/SPARK-47566 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Milan Dankovic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47477: -- Parent: (was: SPARK-46837) Issue Type: New Feature (was: Sub-task) > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *StringInstr* and *FindInSet* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringInstr* and > *FindInSet* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47477: -- Epic Link: SPARK-46830 > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *StringInstr* and *FindInSet* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringInstr* and > *FindInSet* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47477: -- Labels: (was: pull-request-available) > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *StringInstr* and *FindInSet* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringInstr* and > *FindInSet* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47477) SubstringIndex, StringLocate (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Dankovic updated SPARK-47477: --- Labels: (was: pull-request-available) > SubstringIndex, StringLocate (all collations) > - > > Key: SPARK-47477 > URL: https://issues.apache.org/jira/browse/SPARK-47477 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *StringInstr* and *FindInSet* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringInstr* and > *FindInSet* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47431) Add session level default Collation
[ https://issues.apache.org/jira/browse/SPARK-47431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47431. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45592 [https://github.com/apache/spark/pull/45592] > Add session level default Collation > --- > > Key: SPARK-47431 > URL: https://issues.apache.org/jira/browse/SPARK-47431 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > For now, default session level collation is considered as UTF8_BINARY. In > future we want to set this feature with explicit session level configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query
[ https://issues.apache.org/jira/browse/SPARK-47565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47565: --- Labels: pull-request-available (was: ) > PySpark workers dying in daemon mode idle queue fail query > -- > > Key: SPARK-47565 > URL: https://issues.apache.org/jira/browse/SPARK-47565 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.2, 3.5.1, 3.3.4 >Reporter: Sebastian Hillig >Priority: Major > Labels: pull-request-available > > PySpark workers may die after entering the idle queue in > `PythonWorkerFactory`. This may happen because of code that runs in the > process, or external factors. > When drawn from the warmpool, such a worker will result in an I/O exception > on the first read/write . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query
[ https://issues.apache.org/jira/browse/SPARK-47565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830913#comment-17830913 ] Nikita Awasthi commented on SPARK-47565: User 'sebastianhillig-db' has created a pull request for this issue: https://github.com/apache/spark/pull/45635 > PySpark workers dying in daemon mode idle queue fail query > -- > > Key: SPARK-47565 > URL: https://issues.apache.org/jira/browse/SPARK-47565 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.2, 3.5.1, 3.3.4 >Reporter: Sebastian Hillig >Priority: Major > Labels: pull-request-available > > PySpark workers may die after entering the idle queue in > `PythonWorkerFactory`. This may happen because of code that runs in the > process, or external factors. > When drawn from the warmpool, such a worker will result in an I/O exception > on the first read/write . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47565) PySpark workers dying in daemon mode idle queue fail query
Sebastian Hillig created SPARK-47565: Summary: PySpark workers dying in daemon mode idle queue fail query Key: SPARK-47565 URL: https://issues.apache.org/jira/browse/SPARK-47565 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.4, 3.5.1, 3.4.2 Reporter: Sebastian Hillig PySpark workers may die after entering the idle queue in `PythonWorkerFactory`. This may happen because of code that runs in the process, or external factors. When drawn from the warmpool, such a worker will result in an I/O exception on the first read/write . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files
[ https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47564: --- Labels: pull-request-available (was: ) > always throw FAILED_READ_FILE error when fail to read files > --- > > Key: SPARK-47564 > URL: https://issues.apache.org/jira/browse/SPARK-47564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org