[jira] [Created] (SPARK-47621) Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`
Hyukjin Kwon created SPARK-47621: Summary: Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` Key: SPARK-47621 URL: https://issues.apache.org/jira/browse/SPARK-47621 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47363) Initial State without state reader implementation for State API v2.
[ https://issues.apache.org/jira/browse/SPARK-47363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47363. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45467 [https://github.com/apache/spark/pull/45467] > Initial State without state reader implementation for State API v2. > --- > > Key: SPARK-47363 > URL: https://issues.apache.org/jira/browse/SPARK-47363 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Assignee: Jing Zhan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This PR adds support for users to provide a Dataframe that can be used to > instantiate state for the query in the first batch for arbitrary state API v2. > Note that populating the initial state will only happen for the first batch > of the new streaming query. Trying to re-initialize state for the same > grouping key will result in an error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47287) Aggregate in not causes
[ https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831625#comment-17831625 ] Juefei Yan commented on SPARK-47287: I tried the code on 3.4 branch, cannot reproduce this problem > Aggregate in not causes > > > Key: SPARK-47287 > URL: https://issues.apache.org/jira/browse/SPARK-47287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ted Chester Jenks >Priority: Major > > > The below snippet is confirmed working with Spark 3.2.1 and broken Spark > 3.4.1. i believe this is a bug. > {code:java} >Dataset ds = dummyDataset > .withColumn("flag", > functions.not(functions.coalesce(functions.col("bool1"), > functions.lit(false)).equalTo(true))) > .groupBy("code") > .agg(functions.max(functions.col("flag")).alias("flag")); > ds.show(); {code} > It fails with: > {code:java} > Caused by: java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) > at > org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) > at > org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47619) Refine docstring of `to_json/from_json`
[ https://issues.apache.org/jira/browse/SPARK-47619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47619. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45742 [https://github.com/apache/spark/pull/45742] > Refine docstring of `to_json/from_json` > --- > > Key: SPARK-47619 > URL: https://issues.apache.org/jira/browse/SPARK-47619 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44050) Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue
[ https://issues.apache.org/jira/browse/SPARK-44050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831611#comment-17831611 ] liangyouze commented on SPARK-44050: I've encountered the same issue. I'd like to create a PR to try fixing this issue > Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue > -- > > Key: SPARK-44050 > URL: https://issues.apache.org/jira/browse/SPARK-44050 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit >Affects Versions: 3.3.1 >Reporter: Harshwardhan Singh Dodiya >Priority: Critical > Attachments: image-2023-06-14-11-07-36-960.png > > > Dear Spark community, > I am facing an issue related to mounting a ConfigMap in the driver pod of my > Spark application. Upon investigation, I realized that the problem is caused > by the ConfigMap not being created successfully. > *Problem Description:* > When attempting to mount the ConfigMap in the driver pod, I encounter > consistent failures and my pod stays in containerCreating state. Upon further > investigation, I discovered that the ConfigMap does not exist in the > Kubernetes cluster, which results in the driver pod's inability to access the > required configuration data. > *Additional Information:* > I would like to highlight that this issue is not a frequent occurrence. It > has been observed randomly, affecting the mounting of the ConfigMap in the > driver pod only approximately 5% of the time. This intermittent behavior adds > complexity to the troubleshooting process, as it is challenging to reproduce > consistently. > *Error Message:* > when describing driver pod (kubectl describe pod pod_name) get the below > error. > "ConfigMap '' not found." > *To Reproduce:* > 1. Download spark 3.3.1 from [https://spark.apache.org/downloads.html] > 2. create an image with "bin/docker-image-tool.sh" > 3. Submit on spark-client via bash command by passing all the details and > configurations. > 4. Randomly in some of the driver pod we can observe this issue. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47619) Refine docstring of `to_json/from_json`
[ https://issues.apache.org/jira/browse/SPARK-47619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47619: --- Labels: pull-request-available (was: ) > Refine docstring of `to_json/from_json` > --- > > Key: SPARK-47619 > URL: https://issues.apache.org/jira/browse/SPARK-47619 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47619) Refine docstring of `to_json/from_json`
Hyukjin Kwon created SPARK-47619: Summary: Refine docstring of `to_json/from_json` Key: SPARK-47619 URL: https://issues.apache.org/jira/browse/SPARK-47619 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.
[ https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47543. -- Assignee: Haejoon Lee Resolution: Fixed Fixed in https://github.com/apache/spark/pull/45699 > Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame > creation. > > > Key: SPARK-47543 > URL: https://issues.apache.org/jira/browse/SPARK-47543 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Currently the PyArrow infers the Pandas dictionary field as StructType > instead of MapType, so Spark can't handle the schema properly: > {code:java} > >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, > >>> 'second': 0.3}]}) > >>> pa.Schema.from_pandas(pdf) > str_col: string > dict_col: struct > child 0, first: double > child 1, second: double > {code} > We cannot handle this case since we use PyArrow for schema creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.
[ https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47543: - Fix Version/s: 4.0.0 > Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame > creation. > > > Key: SPARK-47543 > URL: https://issues.apache.org/jira/browse/SPARK-47543 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently the PyArrow infers the Pandas dictionary field as StructType > instead of MapType, so Spark can't handle the schema properly: > {code:java} > >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, > >>> 'second': 0.3}]}) > >>> pa.Schema.from_pandas(pdf) > str_col: string > dict_col: struct > child 0, first: double > child 1, second: double > {code} > We cannot handle this case since we use PyArrow for schema creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47107) Implement partition reader for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-47107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47107. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45485 [https://github.com/apache/spark/pull/45485] > Implement partition reader for python streaming data source > --- > > Key: SPARK-47107 > URL: https://issues.apache.org/jira/browse/SPARK-47107 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Piggy back the PythonPartitionReaderFactory to implement reading a data > partition for python streaming data source. Add test case to verify that > python streaming data source can read and process data end to end. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47107) Implement partition reader for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-47107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-47107: Assignee: Chaoqin Li > Implement partition reader for python streaming data source > --- > > Key: SPARK-47107 > URL: https://issues.apache.org/jira/browse/SPARK-47107 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Piggy back the PythonPartitionReaderFactory to implement reading a data > partition for python streaming data source. Add test case to verify that > python streaming data source can read and process data end to end. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47618) Use Magic Committer for all S3 buckets by default
[ https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47618: -- Description: This issue aims to use Apache Hadoop `Magic Committer` for all S3 buckets by default in Apache Spark 4.0.0. Apache Hadoop `Magic Committer` has been used for S3 buckets to get the best performance since [S3 becomes fully consistent on December 1st, 2020|https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/]. - https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel bq. Amazon S3 provides strong read-after-write consistency for PUT and DELETE requests of objects in your Amazon S3 bucket in all AWS Regions. This behavior applies to both writes to new objects as well as PUT requests that overwrite existing objects and DELETE requests. In addition, read operations on Amazon S3 Select, Amazon S3 access controls lists (ACLs), Amazon S3 Object Tags, and object metadata (for example, the HEAD object) are strongly consistent. > Use Magic Committer for all S3 buckets by default > - > > Key: SPARK-47618 > URL: https://issues.apache.org/jira/browse/SPARK-47618 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > This issue aims to use Apache Hadoop `Magic Committer` for all S3 buckets by > default in Apache Spark 4.0.0. > Apache Hadoop `Magic Committer` has been used for S3 buckets to get the best > performance since [S3 becomes fully consistent on December 1st, > 2020|https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/]. > - > https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel > bq. Amazon S3 provides strong read-after-write consistency for PUT and DELETE > requests of objects in your Amazon S3 bucket in all AWS Regions. This > behavior applies to both writes to new objects as well as PUT requests that > overwrite existing objects and DELETE requests. In addition, read operations > on Amazon S3 Select, Amazon S3 access controls lists (ACLs), Amazon S3 Object > Tags, and object metadata (for example, the HEAD object) are strongly > consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47618) Use Magic Committer for all S3 buckets by default
[ https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47618: - Assignee: Dongjoon Hyun > Use Magic Committer for all S3 buckets by default > - > > Key: SPARK-47618 > URL: https://issues.apache.org/jira/browse/SPARK-47618 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47618) Use Magic Committer for all S3 buckets by default
Dongjoon Hyun created SPARK-47618: - Summary: Use Magic Committer for all S3 buckets by default Key: SPARK-47618 URL: https://issues.apache.org/jira/browse/SPARK-47618 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47618) Use Magic Committer for all S3 buckets by default
[ https://issues.apache.org/jira/browse/SPARK-47618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47618: --- Labels: pull-request-available (was: ) > Use Magic Committer for all S3 buckets by default > - > > Key: SPARK-47618 > URL: https://issues.apache.org/jira/browse/SPARK-47618 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47609) CacheManager Lookup can miss picking InMemoryRelation corresponding to subplan
[ https://issues.apache.org/jira/browse/SPARK-47609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47609: - Description: This issue became apparent while bringing my PR [https://github.com/apache/spark/pull/43854] in synch with latest master. Though that PR is meant to do early collapse of projects so that the tree size is kept at minimum when projects keep getting added , in the analyzer phase itself. But as part of the work, the CacheManager lookup also needed to be modified. One of the newly added test in master failed. On analysis of failure it turns out that the cache manager is not picking cached InMemoryRelation for a subplan. This shows up in following existing test org.apache.spark.sql.DatasetCacheSuite {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() {color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color} df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) {color:#00875a}// Verify that df1's cache has stayed the same, since df1's cache already has data{color} // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} {quote} {quote}*{color:#de350b}// This assertion is not right{color}* assert(df2LimitInnerPlan.isDefined && !df2LimitInnerPlan.get.exists(_.isInstanceOf[InMemoryTableScanExec])) } {quote} Since df1 exists in the cache as InMemoryRelation, val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df2 is derivable from the cached df1. So when val df2Limit = df2.limit(2), is created, it should utilize the cached df1 . The pull request for the same is [https://github.com/apache/spark/pull/43854|https://github.com/apache/spark/pull/43854] was: This issue became apparent while bringing my PR [https://github.com/apache/spark/pull/43854] in synch with latest master. Though that PR is meant to do early collapse of projects so that the tree size is kept at minimum when projects keep getting added , in the analyzer phase itself. But as part of the work, the CacheManager lookup also needed to be modified. One of the newly added test in master failed. On analysis of failure it turns out that the cache manager is not picking cached InMemoryRelation for a subplan. This shows up in following existing test org.apache.spark.sql.DatasetCacheSuite {quote}test("SPARK-26708 Cache data and cached plan should stay consistent") { val df = spark.range(0, 5).toDF("a") val df1 = df.withColumn("b", $"a" + 1) val df2 = df.filter($"a" > 1) df.cache() // Add df1 to the CacheManager; the buffer is currently empty. df1.cache() {color:#4c9aff}// After calling collect(), df1's buffer has been loaded.{color} df1.collect() // Add df2 to the CacheManager; the buffer is currently empty. df2.cache() // Verify that df1 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df1) val df1InnerPlan = df1.queryExecution.withCachedData .asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan // Verify that df2 is a InMemoryRelation plan with dependency on another cached plan. assertCacheDependency(df2) df.unpersist(blocking = true) {color:#00875a}// Verify that df1's cache has stayed the same, since df1's cache already has data{color} // before df.unpersist(). val df1Limit = df1.limit(2) val df1LimitInnerPlan = df1Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i} assert(df1LimitInnerPlan.isDefined && df1LimitInnerPlan.get == df1InnerPlan) // Verify that df2's cache has been re-cached, with a new physical plan rid of dependency // on df, since df2's cache had not been loaded before df.unpersist(). val df2Limit = df2.limit(2) val df2LimitInnerPlan = df2Limit.queryExecution.withCachedData.collectFirst Unknown macro: \{ case i}{quote} {quote}*{color:#de350b}// This assertion is not right{color}*
[jira] [Resolved] (SPARK-47485) Create column with collations in dataframe API
[ https://issues.apache.org/jira/browse/SPARK-47485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47485. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45569 [https://github.com/apache/spark/pull/45569] > Create column with collations in dataframe API > -- > > Key: SPARK-47485 > URL: https://issues.apache.org/jira/browse/SPARK-47485 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add ability to create string columns with non default collations in the > dataframe API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations
[ https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47617: --- Labels: pull-request-available (was: ) > Add TPC-DS testing infrastructure for collations > > > Key: SPARK-47617 > URL: https://issues.apache.org/jira/browse/SPARK-47617 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Labels: pull-request-available > > As collation support grows across all SQL features and new collation types > are added, we need to have reliable testing model covering as many standard > SQL capabilities as possible. > We can utilize TPC-DS testing infrastructure already present in Spark. The > idea is to vary TPC-DS table string columns by adding multiple collations > with different ordering rules and case sensitivity, producing new tables. > These tables should yield the same results against predefined TPC-DS queries > for certain batches of collations. For example, when comparing query runs on > table where columns are first collated as UTF8_BINARY and then as > UTF8_BINARY_LCASE, we should be getting same results after converting to > lowercase. > Introduce new query suite which tests the described behavior with available > collations (utf8_binary and unicode) combined with case conversions > (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations
[ https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47617: -- Description: As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TPC-DS testing infrastructure already present in Spark. The idea is to vary TPC-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TPC-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). was: As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TCP-DS testing infrastructure already present in Spark. The idea is to vary TCP-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TCP-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). > Add TPC-DS testing infrastructure for collations > > > Key: SPARK-47617 > URL: https://issues.apache.org/jira/browse/SPARK-47617 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > As collation support grows across all SQL features and new collation types > are added, we need to have reliable testing model covering as many standard > SQL capabilities as possible. > We can utilize TPC-DS testing infrastructure already present in Spark. The > idea is to vary TPC-DS table string columns by adding multiple collations > with different ordering rules and case sensitivity, producing new tables. > These tables should yield the same results against predefined TPC-DS queries > for certain batches of collations. For example, when comparing query runs on > table where columns are first collated as UTF8_BINARY and then as > UTF8_BINARY_LCASE, we should be getting same results after converting to > lowercase. > Introduce new query suite which tests the described behavior with available > collations (utf8_binary and unicode) combined with case conversions > (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations
[ https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47617: -- Summary: Add TPC-DS testing infrastructure for collations (was: Add TCP-DS testing infrastructure for collations) > Add TPC-DS testing infrastructure for collations > > > Key: SPARK-47617 > URL: https://issues.apache.org/jira/browse/SPARK-47617 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > As collation support grows across all SQL features and new collation types > are added, we need to have reliable testing model covering as many standard > SQL capabilities as possible. > We can utilize TCP-DS testing infrastructure already present in Spark. The > idea is to vary TCP-DS table string columns by adding multiple collations > with different ordering rules and case sensitivity, producing new tables. > These tables should yield the same results against predefined TCP-DS queries > for certain batches of collations. For example, when comparing query runs on > table where columns are first collated as UTF8_BINARY and then as > UTF8_BINARY_LCASE, we should be getting same results after converting to > lowercase. > Introduce new query suite which tests the described behavior with available > collations (utf8_binary and unicode) combined with case conversions > (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831421#comment-17831421 ] David Milicevic commented on SPARK-47413: - Hey [~gpgp], let's wait for [~uros-db] to confirm, I'm new to the team so still getting familiar with the things. > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47617) Add TCP-DS testing infrastructure for collations
Nikola Mandic created SPARK-47617: - Summary: Add TCP-DS testing infrastructure for collations Key: SPARK-47617 URL: https://issues.apache.org/jira/browse/SPARK-47617 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TCP-DS testing infrastructure already present in Spark. The idea is to vary TCP-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TCP-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831393#comment-17831393 ] Gideon P commented on SPARK-47413: -- [~davidm-db] please confirm the expected behavior I have outlined above! > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392 ] Gideon P edited comment on SPARK-47413 at 3/27/24 2:53 PM: --- > First confirm what is the expected behaviour for these functions when given > collated strings, h1. Collation Handling Summary For the substr, substring, left, and right string manipulation functions in SQL, the explicit or implicit collation of the first parameter is preserved in the function's output. This means that if the input string has a specific collation (whether defined explicitly through a COLLATE expression or implicitly by its source, such as a column's collation), this collation is maintained in the resulting string produced by these functions. The behavior of these functions, apart from collation handling, remains consistent with their standard operation. This includes their handling of starting positions, lengths, and their ability to work with both positive and negative indices for defining substring boundaries. While `len` parameter (a number) can be a string, it's implicit or explicit collation will be throw away and not effect output. Note: unit tests should show that we achieved the following: Collation will be supported for: - STRING columns - STRING expressions - STRING fields in structs h1. Session Collation The third level of collation. Will get back to you about expectations with regards to these four functions. was (Author: JIRAUSER304403): > First confirm what is the expected behaviour for these functions when given > collated strings, # Collation Handling Summary For the substr, substring, left, and right string manipulation functions in SQL, the explicit or implicit collation of the first parameter is preserved in the function's output. This means that if the input string has a specific collation (whether defined explicitly through a COLLATE expression or implicitly by its source, such as a column's collation), this collation is maintained in the resulting string produced by these functions. The behavior of these functions, apart from collation handling, remains consistent with their standard operation. This includes their handling of starting positions, lengths, and their ability to work with both positive and negative indices for defining substring boundaries. While `len` parameter (a number) can be a string, it's implicit or explicit collation will be throw away and not effect output. Note: unit tests should show that we achieved the following: Collation will be supported for: ● STRING columns ● STRING expressions ● STRING fields in structs # Session Collation The third level of collation. Will get back to you about expectations with regards to these four functions. > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392 ] Gideon P commented on SPARK-47413: -- > First confirm what is the expected behaviour for these functions when given > collated strings, # Collation Handling Summary For the substr, substring, left, and right string manipulation functions in SQL, the explicit or implicit collation of the first parameter is preserved in the function's output. This means that if the input string has a specific collation (whether defined explicitly through a COLLATE expression or implicitly by its source, such as a column's collation), this collation is maintained in the resulting string produced by these functions. The behavior of these functions, apart from collation handling, remains consistent with their standard operation. This includes their handling of starting positions, lengths, and their ability to work with both positive and negative indices for defining substring boundaries. While `len` parameter (a number) can be a string, it's implicit or explicit collation will be throw away and not effect output. Note: unit tests should show that we achieved the following: Collation will be supported for: ● STRING columns ● STRING expressions ● STRING fields in structs # Session Collation The third level of collation. Will get back to you about expectations with regards to these four functions. > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831392#comment-17831392 ] Gideon P edited comment on SPARK-47413 at 3/27/24 2:53 PM: --- bq. First confirm what is the expected behaviour for these functions when given collated strings, h1. Collation Handling Summary For the substr, substring, left, and right string manipulation functions in SQL, the explicit or implicit collation of the first parameter is preserved in the function's output. This means that if the input string has a specific collation (whether defined explicitly through a COLLATE expression or implicitly by its source, such as a column's collation), this collation is maintained in the resulting string produced by these functions. The behavior of these functions, apart from collation handling, remains consistent with their standard operation. This includes their handling of starting positions, lengths, and their ability to work with both positive and negative indices for defining substring boundaries. While `len` parameter (a number) can be a string, it's implicit or explicit collation will be throw away and not effect output. Note: unit tests should show that we achieved the following: Collation will be supported for: - STRING columns - STRING expressions - STRING fields in structs h1. Session Collation The third level of collation. Will get back to you about expectations with regards to these four functions. was (Author: JIRAUSER304403): > First confirm what is the expected behaviour for these functions when given > collated strings, h1. Collation Handling Summary For the substr, substring, left, and right string manipulation functions in SQL, the explicit or implicit collation of the first parameter is preserved in the function's output. This means that if the input string has a specific collation (whether defined explicitly through a COLLATE expression or implicitly by its source, such as a column's collation), this collation is maintained in the resulting string produced by these functions. The behavior of these functions, apart from collation handling, remains consistent with their standard operation. This includes their handling of starting positions, lengths, and their ability to work with both positive and negative indices for defining substring boundaries. While `len` parameter (a number) can be a string, it's implicit or explicit collation will be throw away and not effect output. Note: unit tests should show that we achieved the following: Collation will be supported for: - STRING columns - STRING expressions - STRING fields in structs h1. Session Collation The third level of collation. Will get back to you about expectations with regards to these four functions. > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Resolved] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL
[ https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47616. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45736 [https://github.com/apache/spark/pull/45736] > Document Mapping Spark SQL Data Types from MySQL > > > Key: SPARK-47616 > URL: https://issues.apache.org/jira/browse/SPARK-47616 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL
[ https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47616: - Assignee: Kent Yao > Document Mapping Spark SQL Data Types from MySQL > > > Key: SPARK-47616 > URL: https://issues.apache.org/jira/browse/SPARK-47616 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47413: --- Labels: pull-request-available (was: ) > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files
[ https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47564: Assignee: Wenchen Fan > always throw FAILED_READ_FILE error when fail to read files > --- > > Key: SPARK-47564 > URL: https://issues.apache.org/jira/browse/SPARK-47564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47564) always throw FAILED_READ_FILE error when fail to read files
[ https://issues.apache.org/jira/browse/SPARK-47564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47564. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45723 [https://github.com/apache/spark/pull/45723] > always throw FAILED_READ_FILE error when fail to read files > --- > > Key: SPARK-47564 > URL: https://issues.apache.org/jira/browse/SPARK-47564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47563) Map normalization upon creation
[ https://issues.apache.org/jira/browse/SPARK-47563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47563. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45721 [https://github.com/apache/spark/pull/45721] > Map normalization upon creation > --- > > Key: SPARK-47563 > URL: https://issues.apache.org/jira/browse/SPARK-47563 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stevo Mitric >Assignee: Stevo Mitric >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add handling of map normalization upon creation in ArrayBasedMapBuilder. > Currently a map with keys 0.0 and -0.0 will behave as if they are separate > values. This will cause issues when doing GROUP BY on map types. > Refer to this conversion > [https://github.com/apache/spark/pull/45549#discussion_r1537803505] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47563) Map normalization upon creation
[ https://issues.apache.org/jira/browse/SPARK-47563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47563: --- Assignee: Stevo Mitric > Map normalization upon creation > --- > > Key: SPARK-47563 > URL: https://issues.apache.org/jira/browse/SPARK-47563 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stevo Mitric >Assignee: Stevo Mitric >Priority: Major > Labels: pull-request-available > > Add handling of map normalization upon creation in ArrayBasedMapBuilder. > Currently a map with keys 0.0 and -0.0 will behave as if they are separate > values. This will cause issues when doing GROUP BY on map types. > Refer to this conversion > [https://github.com/apache/spark/pull/45549#discussion_r1537803505] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."
[ https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831288#comment-17831288 ] Jagadeesh Marada commented on SPARK-47489: -- I have added the spark driver log with TRACE log level for your better understanding of this issue. > Action performed on Dataset is triggering an error as "The Spark SQL phase > planning failed with an internal error." > --- > > Key: SPARK-47489 > URL: https://issues.apache.org/jira/browse/SPARK-47489 > Project: Spark > Issue Type: Bug > Components: k8s, Spark Core, Spark Submit >Affects Versions: 3.4.2 >Reporter: Jagadeesh Marada >Priority: Major > Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added > to spark_submit.jpeg, landing-table-update-1711534916824-driver_TRACE.txt, > spark-submit cmd.txt > > > Hi team, > We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark > driver as below. > "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase > planning failed with an internal error. You hit a bug in Spark or the Spark > plugins you use. Please, report this bug to the corresponding communities or > vendors, and provide the full stack trace." > > Complete stack trace is as attached. > After analysing further , found that when we try to perform any actions on > the dataset, spark is unable to plan its actions, results in this exception. > Attaching a Exception stack trace & code snippet where exactly we are > trigging an action to convert it as java RDD. > > Attaching the spark submit command as well for your reference. > Can you please check this issue let us know if any fix available for this > issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."
[ https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesh Marada updated SPARK-47489: - Attachment: (was: landing-table-update-1710866680486-driver.txt) > Action performed on Dataset is triggering an error as "The Spark SQL phase > planning failed with an internal error." > --- > > Key: SPARK-47489 > URL: https://issues.apache.org/jira/browse/SPARK-47489 > Project: Spark > Issue Type: Bug > Components: k8s, Spark Core, Spark Submit >Affects Versions: 3.4.2 >Reporter: Jagadeesh Marada >Priority: Major > Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added > to spark_submit.jpeg, landing-table-update-1711534916824-driver_TRACE.txt, > spark-submit cmd.txt > > > Hi team, > We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark > driver as below. > "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase > planning failed with an internal error. You hit a bug in Spark or the Spark > plugins you use. Please, report this bug to the corresponding communities or > vendors, and provide the full stack trace." > > Complete stack trace is as attached. > After analysing further , found that when we try to perform any actions on > the dataset, spark is unable to plan its actions, results in this exception. > Attaching a Exception stack trace & code snippet where exactly we are > trigging an action to convert it as java RDD. > > Attaching the spark submit command as well for your reference. > Can you please check this issue let us know if any fix available for this > issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."
[ https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesh Marada updated SPARK-47489: - Attachment: landing-table-update-1711534916824-driver_TRACE.txt > Action performed on Dataset is triggering an error as "The Spark SQL phase > planning failed with an internal error." > --- > > Key: SPARK-47489 > URL: https://issues.apache.org/jira/browse/SPARK-47489 > Project: Spark > Issue Type: Bug > Components: k8s, Spark Core, Spark Submit >Affects Versions: 3.4.2 >Reporter: Jagadeesh Marada >Priority: Major > Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added > to spark_submit.jpeg, landing-table-update-1710866680486-driver.txt, > landing-table-update-1711534916824-driver_TRACE.txt, spark-submit cmd.txt > > > Hi team, > We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark > driver as below. > "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase > planning failed with an internal error. You hit a bug in Spark or the Spark > plugins you use. Please, report this bug to the corresponding communities or > vendors, and provide the full stack trace." > > Complete stack trace is as attached. > After analysing further , found that when we try to perform any actions on > the dataset, spark is unable to plan its actions, results in this exception. > Attaching a Exception stack trace & code snippet where exactly we are > trigging an action to convert it as java RDD. > > Attaching the spark submit command as well for your reference. > Can you please check this issue let us know if any fix available for this > issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47489) Action performed on Dataset is triggering an error as "The Spark SQL phase planning failed with an internal error."
[ https://issues.apache.org/jira/browse/SPARK-47489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesh Marada updated SPARK-47489: - Attachment: landing-table-update-1710866680486-driver.txt > Action performed on Dataset is triggering an error as "The Spark SQL phase > planning failed with an internal error." > --- > > Key: SPARK-47489 > URL: https://issues.apache.org/jira/browse/SPARK-47489 > Project: Spark > Issue Type: Bug > Components: k8s, Spark Core, Spark Submit >Affects Versions: 3.4.2 >Reporter: Jagadeesh Marada >Priority: Major > Attachments: code snippet.jpeg, exception_stack_trace.txt, jars added > to spark_submit.jpeg, landing-table-update-1710866680486-driver.txt, > spark-submit cmd.txt > > > Hi team, > We are upgrading to spark 3.4.2 from 3.3.2 and facing issue in the spark > driver as below. > "org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase > planning failed with an internal error. You hit a bug in Spark or the Spark > plugins you use. Please, report this bug to the corresponding communities or > vendors, and provide the full stack trace." > > Complete stack trace is as attached. > After analysing further , found that when we try to perform any actions on > the dataset, spark is unable to plan its actions, results in this exception. > Attaching a Exception stack trace & code snippet where exactly we are > trigging an action to convert it as java RDD. > > Attaching the spark submit command as well for your reference. > Can you please check this issue let us know if any fix available for this > issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL
[ https://issues.apache.org/jira/browse/SPARK-47616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47616: --- Labels: pull-request-available (was: ) > Document Mapping Spark SQL Data Types from MySQL > > > Key: SPARK-47616 > URL: https://issues.apache.org/jira/browse/SPARK-47616 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47616) Document Mapping Spark SQL Data Types from MySQL
Kent Yao created SPARK-47616: Summary: Document Mapping Spark SQL Data Types from MySQL Key: SPARK-47616 URL: https://issues.apache.org/jira/browse/SPARK-47616 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47615) Aggregate + First() Function - ArrayIndexOutOfBoundsException - ColumnPruning?
Frederik Schreiber created SPARK-47615: -- Summary: Aggregate + First() Function - ArrayIndexOutOfBoundsException - ColumnPruning? Key: SPARK-47615 URL: https://issues.apache.org/jira/browse/SPARK-47615 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.5.0, 3.4.1 Environment: Amazon EMR version emr-7.0.0 Installed applications Tez 0.10.2, Spark 3.5.0 Amazon Linux release 2023.3.20240312.0 1 Master Node m6g.xlarge 2 Core Nodes m6g.2xlarge Reporter: Frederik Schreiber Currently i`m investigating in upgrade our code base from spark 3.3.0 to 3.5.0 (embedded in dedicated aws emr cluster). I got the following exception if i execute my code on the cluster, if i run local unit tests the code runs as expected without exception. {code:java} 24/03/26 19:32:19 INFO RecordServerQueryListener: Cleaning up temp directory - /user/KKQI7VHKTMNQZJQNMMZXKH5KYNRPOHXG/application_1711468652551_0023 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 186) (ip-10-1-1-6.eu-central-1.compute.internal executor 2): java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 3 at org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:95) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnaraggregatetorow_parquetMax_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnaraggregatetorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142) at org.apache.spark.shuffle.ShuffleWriteProcessor.doWrite(ShuffleWriteProcessor.scala:45) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3067) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3003) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3002) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3002) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1318) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1318) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1318) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3271) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3205) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3194) at
[jira] [Resolved] (SPARK-47611) Cleanup dead code in MySQLDialect.getCatalystType
[ https://issues.apache.org/jira/browse/SPARK-47611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47611. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45734 [https://github.com/apache/spark/pull/45734] > Cleanup dead code in MySQLDialect.getCatalystType > - > > Key: SPARK-47611 > URL: https://issues.apache.org/jira/browse/SPARK-47611 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47491) Re-enable `driver log links` test in YarnClusterSuite
[ https://issues.apache.org/jira/browse/SPARK-47491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47491. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45618 [https://github.com/apache/spark/pull/45618] > Re-enable `driver log links` test in YarnClusterSuite > - > > Key: SPARK-47491 > URL: https://issues.apache.org/jira/browse/SPARK-47491 > Project: Spark > Issue Type: Sub-task > Components: Tests, YARN >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`
[ https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47614: --- Labels: pull-request-available (was: ) > Rename `JavaModuleOptions` to `JVMRuntimeOptions` > - > > Key: SPARK-47614 > URL: https://issues.apache.org/jira/browse/SPARK-47614 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`
BingKun Pan created SPARK-47614: --- Summary: Rename `JavaModuleOptions` to `JVMRuntimeOptions` Key: SPARK-47614 URL: https://issues.apache.org/jira/browse/SPARK-47614 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39983) Should not cache unserialized broadcast relations on the driver
[ https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-39983: --- Labels: pull-request-available (was: ) > Should not cache unserialized broadcast relations on the driver > --- > > Key: SPARK-39983 > URL: https://issues.apache.org/jira/browse/SPARK-39983 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Alex Balikov >Assignee: Alex Balikov >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in > addition to the serialized version of it - > {code:java} > private def writeBlocks(value: T): Int = { > import StorageLevel._ > // Store a copy of the broadcast variable in the driver so that tasks run > on the driver > // do not create a duplicate copy of the broadcast variable's value. > val blockManager = SparkEnv.get.blockManager > if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, > tellMaster = false)) { > throw new SparkException(s"Failed to store $broadcastId in > BlockManager") > } > {code} > In case of broadcast relations, these objects can be fairly large (60MB in > one observed case) and are not strictly necessary on the driver. > Add the option to not keep the unserialized versions of the objects. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47613) Issue with Spark Connect on Python 3.12
Kai-Michael Roesner created SPARK-47613: --- Summary: Issue with Spark Connect on Python 3.12 Key: SPARK-47613 URL: https://issues.apache.org/jira/browse/SPARK-47613 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.0, 3.4.1 Reporter: Kai-Michael Roesner When trying to create a remote Spark session with PySpark on Python 3.12 a {{ModuleNotFoundError: No module named 'distutils'}} excpetion is thrown. In Python 3.12 {{distutils}} was removed from the stdlib. As a workaround we can {{import setuptools}} before creating the session. See also [this question on SOF|https://stackoverflow.com/questions/78207291] and the [answer|https://stackoverflow.com/a/78212125/11474852] by Anderson Bravalheri. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47612) Improve picking the side of partially clustered distribution accroding to partition size
[ https://issues.apache.org/jira/browse/SPARK-47612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated SPARK-47612: --- Description: Now we pick up the side of partially clustered distribution: SPJ currently relies on a simple heuristic and always pick the side with less data size based on table statistics as the side fully clustered, even though it could also contain skewed partitions. We can potentially do fine-grained comparison based on partition values, since we have the information now. was: Now we pick up the side of partially clustered distribution: Using plan statistics to determine which side of join to fully cluster partition values. We can optimize to use partition size since we have the information now. > Improve picking the side of partially clustered distribution accroding to > partition size > > > Key: SPARK-47612 > URL: https://issues.apache.org/jira/browse/SPARK-47612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Qi Zhu >Priority: Major > > Now we pick up the side of partially clustered distribution: > SPJ currently relies on a simple heuristic and always pick the side with less > data size based on table statistics as the side fully clustered, even though > it could also contain skewed partitions. > We can potentially do fine-grained comparison based on partition values, > since we have the information now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org