[jira] [Assigned] (SPARK-47210) Implicit casting on collated expressions
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47210: -- Assignee: Apache Spark > Implicit casting on collated expressions > > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > Another thing is that `INDETERMINATE_COLLATION` should only be thrown on > comparison operations, and we should be able to combine different implicit > collations for certain operations like concat and possible others in the > future. > This is why I had to add another predefined collation id named > `INDETERMINATE_COLLATION_ID` which means that the result is a combination of > conflicting non-default implicit collations. Right now it has an id of -1 so > it fails if it ever goes to the `CollatorFactory`. > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47210) Implicit casting on collated expressions
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47210: -- Assignee: (was: Apache Spark) > Implicit casting on collated expressions > > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > Another thing is that `INDETERMINATE_COLLATION` should only be thrown on > comparison operations, and we should be able to combine different implicit > collations for certain operations like concat and possible others in the > future. > This is why I had to add another predefined collation id named > `INDETERMINATE_COLLATION_ID` which means that the result is a combination of > conflicting non-default implicit collations. Right now it has an id of -1 so > it fails if it ever goes to the `CollatorFactory`. > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47621) Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`
[ https://issues.apache.org/jira/browse/SPARK-47621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47621: --- Labels: pull-request-available (was: ) > Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` > -- > > Key: SPARK-47621 > URL: https://issues.apache.org/jira/browse/SPARK-47621 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47625) Addition of Indeterminate Collation Support
Mihailo Milosevic created SPARK-47625: - Summary: Addition of Indeterminate Collation Support Key: SPARK-47625 URL: https://issues.apache.org/jira/browse/SPARK-47625 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Mihailo Milosevic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47626) Addition for Map Implicit Casting of Collated Strings
Mihailo Milosevic created SPARK-47626: - Summary: Addition for Map Implicit Casting of Collated Strings Key: SPARK-47626 URL: https://issues.apache.org/jira/browse/SPARK-47626 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Mihailo Milosevic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47625) Addition of Indeterminate Collation Support
[ https://issues.apache.org/jira/browse/SPARK-47625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47625: -- Description: {{INDETERMINATE_COLLATION}} should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future. This is why we have to add another predefined collation id named {{INDETERMINATE_COLLATION_ID}} which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the {{{}CollatorFactory{}}}. > Addition of Indeterminate Collation Support > --- > > Key: SPARK-47625 > URL: https://issues.apache.org/jira/browse/SPARK-47625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > > {{INDETERMINATE_COLLATION}} should only be thrown on comparison operations > and memory storing of data, and we should be able to combine different > implicit collations for certain operations like concat and possible others in > the future. > This is why we have to add another predefined collation id named > {{INDETERMINATE_COLLATION_ID}} which means that the result is a combination > of conflicting non-default implicit collations. Right now it would an id of > -1 so it fail if it ever goes to the {{{}CollatorFactory{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47409) StringTrim & StringTrimLeft/Right/Both (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47409: --- Labels: pull-request-available (was: ) > StringTrim & StringTrimLeft/Right/Both (all collations) > --- > > Key: SPARK-47409 > URL: https://issues.apache.org/jira/browse/SPARK-47409 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *StringTrim* built-in string function in > Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, > {*}StringTrimRight{*}). First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|[https://www.postgresql.org/docs/]]. > > The goal for this Jira ticket is to implement the *StringTrim* function so it > supports all collation types currently supported in Spark. To understand what > changes were introduced in order to enable full collation support for other > existing functions in Spark, take a look at the Spark PRs and Jira tickets > for completed tasks in this parent (for example: Contains, StartsWith, > EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47626) Addition for Map Implicit Casting of Collated Strings
[ https://issues.apache.org/jira/browse/SPARK-47626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47626: -- Description: Initial ticket for addition of collation implicit casting SPARK-47210 introduced support for casting of arrays and normal string types. This ticket needs to dive into the problem of casting MapType. (was: Initial PR for addition of collation implicit casting [SPARK-47210] introduced support for casting of arrays and normal string types.) > Addition for Map Implicit Casting of Collated Strings > - > > Key: SPARK-47626 > URL: https://issues.apache.org/jira/browse/SPARK-47626 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > > Initial ticket for addition of collation implicit casting SPARK-47210 > introduced support for casting of arrays and normal string types. This ticket > needs to dive into the problem of casting MapType. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47210) Implicit casting on collated expressions
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47210: -- Epic Link: (was: SPARK-46830) > Implicit casting on collated expressions > > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > Another thing is that `INDETERMINATE_COLLATION` should only be thrown on > comparison operations, and we should be able to combine different implicit > collations for certain operations like concat and possible others in the > future. > This is why I had to add another predefined collation id named > `INDETERMINATE_COLLATION_ID` which means that the result is a combination of > conflicting non-default implicit collations. Right now it has an id of -1 so > it fails if it ever goes to the `CollatorFactory`. > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47210) Addition of implicit casting without indeterminate support
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47210: -- Summary: Addition of implicit casting without indeterminate support (was: Implicit casting on collated expressions) > Addition of implicit casting without indeterminate support > -- > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > Another thing is that `INDETERMINATE_COLLATION` should only be thrown on > comparison operations, and we should be able to combine different implicit > collations for certain operations like concat and possible others in the > future. > This is why I had to add another predefined collation id named > `INDETERMINATE_COLLATION_ID` which means that the result is a combination of > conflicting non-default implicit collations. Right now it has an id of -1 so > it fails if it ever goes to the `CollatorFactory`. > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47210) Implicit casting on collated expressions
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47210: -- Parent: SPARK-47624 Issue Type: Sub-task (was: Improvement) > Implicit casting on collated expressions > > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > Another thing is that `INDETERMINATE_COLLATION` should only be thrown on > comparison operations, and we should be able to combine different implicit > collations for certain operations like concat and possible others in the > future. > This is why I had to add another predefined collation id named > `INDETERMINATE_COLLATION_ID` which means that the result is a combination of > conflicting non-default implicit collations. Right now it has an id of -1 so > it fails if it ever goes to the `CollatorFactory`. > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47210) Addition of implicit casting without indeterminate support
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47210: -- Description: *What changes were proposed in this pull request?* This PR adds automatic casting and collations resolution as per `PGSQL` behaviour: 1. Collations set on the metadata level are implicit 2. Collations set using the `COLLATE` expression are explicit 3. When there is a combination of expressions of multiple collations the output will be: - if there are explicit collations and all of them are equal then that collation will be the output - if there are multiple different explicit collations `COLLATION_MISMATCH.EXPLICIT` will be thrown - if there are no explicit collations and only a single type of non default collation, that one will be used - if there are no explicit collations and multiple non-default implicit ones `COLLATION_MISMATCH.IMPLICIT` will be thrown *Why are the changes needed?* We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use. was: *What changes were proposed in this pull request?* This PR adds automatic casting and collations resolution as per `PGSQL` behaviour: 1. Collations set on the metadata level are implicit 2. Collations set using the `COLLATE` expression are explicit 3. When there is a combination of expressions of multiple collations the output will be: - if there are explicit collations and all of them are equal then that collation will be the output - if there are multiple different explicit collations `COLLATION_MISMATCH.EXPLICIT` will be thrown - if there are no explicit collations and only a single type of non default collation, that one will be used - if there are no explicit collations and multiple non-default implicit ones `COLLATION_MISMATCH.IMPLICIT` will be thrown Another thing is that `INDETERMINATE_COLLATION` should only be thrown on comparison operations, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future. This is why I had to add another predefined collation id named `INDETERMINATE_COLLATION_ID` which means that the result is a combination of conflicting non-default implicit collations. Right now it has an id of -1 so it fails if it ever goes to the `CollatorFactory`. *Why are the changes needed?* We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use. > Addition of implicit casting without indeterminate support > -- > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds automatic casting and collations resolution as per `PGSQL` > behaviour: > 1. Collations set on the metadata level are implicit > 2. Collations set using the `COLLATE` expression are explicit > 3. When there is a combination of expressions of multiple collations the > output will be: > - if there are explicit collations and all of them are equal then that > collation will be the output > - if there are multiple different explicit collations > `COLLATION_MISMATCH.EXPLICIT` will be thrown > - if there are no explicit collations and only a single type of non default > collation, that one will be used > - if there are no explicit collations and multiple non-default implicit ones > `COLLATION_MISMATCH.IMPLICIT` will be thrown > *Why are the changes needed?* > We need to be able to compare columns and values with different collations > and set a way of explicitly changing the collation we want to use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47624) Collation Implict Casting Support
Mihailo Milosevic created SPARK-47624: - Summary: Collation Implict Casting Support Key: SPARK-47624 URL: https://issues.apache.org/jira/browse/SPARK-47624 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Mihailo Milosevic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47626) Addition for Map Implicit Casting of Collated Strings
[ https://issues.apache.org/jira/browse/SPARK-47626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47626: -- Description: Initial PR for addition of collation implicit casting [SPARK-47210] introduced support for casting of arrays and normal string types. > Addition for Map Implicit Casting of Collated Strings > - > > Key: SPARK-47626 > URL: https://issues.apache.org/jira/browse/SPARK-47626 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > > Initial PR for addition of collation implicit casting [SPARK-47210] > introduced support for casting of arrays and normal string types. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47622) Spark creates lot of tiny blocks for a single driverLog file of size less than a dfs.blocksize
Srinivasu Majeti created SPARK-47622: Summary: Spark creates lot of tiny blocks for a single driverLog file of size less than a dfs.blocksize Key: SPARK-47622 URL: https://issues.apache.org/jira/browse/SPARK-47622 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Affects Versions: 3.3.2 Reporter: Srinivasu Majeti Upon reviewing the spark code, found that /user/spark/driverLogs are synced to HDFS with hsync option as shown below. {code:java} hdfsStream.hsync(EnumSet.allOf(classOf[HdfsDataOutputStream.SyncFlag])) Ref: https://github.com/apache/spark/blob/a3c04ec1145662e4227d57cd953bffce96b8aad7/core/src/main/scala/org/apache/spark/util/logging/DriverLogger.scala{code} As a result of this we see lot of tiny blocks getting synced every 5 seconds with a new block. So we see a small HDFS file with 8 blocks as shown in the below example. {code:java} [r...@ccycloud-3.smajeti.root.comops.site subdir0]# hdfs fsck /user/spark/driverLogs/application_1710495774861_0002_driver.log Connecting to namenode via https://ccycloud-3.smajeti.root.comops.site:20102/fsck?ugi=hdfs=%2Fuser%2Fspark%2FdriverLogs%2Fapplication_1710495774861_0002_driver.log FSCK started by hdfs (auth:KERBEROS_SSL) from /10.140.136.139 for path /user/spark/driverLogs/application_1710495774861_0002_driver.log at Thu Mar 28 06:37:29 UTC 2024 Status: HEALTHY Number of data-nodes: 4 Number of racks: 1 Total dirs:0 Total symlinks:0 Replicated Blocks: Total size:157574 B Total files: 1 Total blocks (validated): 8 (avg. block size 19696 B) Minimally replicated blocks: 8 (100.0 %) {code} HdfsDataOutputStream.SyncFlag includes two flags UPDATE_LENGTH and END_BLOCK. This has been an expected behavior for some time now and these flags will help visualize the latest size of the HDFS Driver log file and to achieve that, blocks are being ended/closed every 5-second sync. Every new sync will create a new block for the same HDFS driver log file. This hysnc behavior was started after fixing SPARK-29105 (SHS may delete driver log file of in-progress application) 5 years back. But this leaves Namenode to manage a lot of meta and becomes an overhead at times in large clusters. {code:java} public static enum SyncFlag { UPDATE_LENGTH, END_BLOCK; private SyncFlag() { } } {code} I don't see any configurable option to avoid this and avoiding this type of hsync may have some side effects in spark as we saw SPARK-29105 bug. We only have two options that needs manual intevention 1. Keep cleaning these Driver logs after some time 2. Keep merging these small block files into files with 128MB Can we provide some customizable option to merge these blocks while closing the spark-shell or during closing the driver log file? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47628) Fix Postgres bit array issue 'Cannot cast to boolean'
[ https://issues.apache.org/jira/browse/SPARK-47628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47628: --- Labels: pull-request-available (was: ) > Fix Postgres bit array issue 'Cannot cast to boolean' > - > > Key: SPARK-47628 > URL: https://issues.apache.org/jira/browse/SPARK-47628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > > {code:java} > [info] Cause: org.postgresql.util.PSQLException: Cannot cast to boolean: > "10101" > [info] at > org.postgresql.jdbc.BooleanTypeUtil.cannotCoerceException(BooleanTypeUtil.java:99) > [info] at > org.postgresql.jdbc.BooleanTypeUtil.fromString(BooleanTypeUtil.java:67) > [info] at > org.postgresql.jdbc.ArrayDecoding$7.parseValue(ArrayDecoding.java:267) > [info] at > org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:128) > [info] at > org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:763) > [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:320) > [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:179) > [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:116) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$25(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:561) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24$adapted(JdbcUtils.scala:545) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:365) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:346) > [info] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > [info] at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47621) Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`
[ https://issues.apache.org/jira/browse/SPARK-47621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-47621. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45745 [https://github.com/apache/spark/pull/45745] > Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` > -- > > Key: SPARK-47621 > URL: https://issues.apache.org/jira/browse/SPARK-47621 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47559) Codegen Support for variant `parse_json`
[ https://issues.apache.org/jira/browse/SPARK-47559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47559: --- Assignee: BingKun Pan > Codegen Support for variant `parse_json` > > > Key: SPARK-47559 > URL: https://issues.apache.org/jira/browse/SPARK-47559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47559) Codegen Support for variant `parse_json`
[ https://issues.apache.org/jira/browse/SPARK-47559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47559. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45714 [https://github.com/apache/spark/pull/45714] > Codegen Support for variant `parse_json` > > > Key: SPARK-47559 > URL: https://issues.apache.org/jira/browse/SPARK-47559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47629) Add `common/variant` to maven daily test module list
[ https://issues.apache.org/jira/browse/SPARK-47629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47629: --- Labels: pull-request-available (was: ) > Add `common/variant` to maven daily test module list > > > Key: SPARK-47629 > URL: https://issues.apache.org/jira/browse/SPARK-47629 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47629) Add `common/variant` to maven daily test module list
Yang Jie created SPARK-47629: Summary: Add `common/variant` to maven daily test module list Key: SPARK-47629 URL: https://issues.apache.org/jira/browse/SPARK-47629 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47475) Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-47475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47475: -- Summary: Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode (was: Jars Download from Driver Caused Executor Scalability Issue) > Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode > - > > Key: SPARK-47475 > URL: https://issues.apache.org/jira/browse/SPARK-47475 > Project: Spark > Issue Type: Improvement > Components: Deploy, Kubernetes, Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Jiale Tan >Assignee: Jiale Tan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Under K8s cluster deployment mode, all the jars, including primary resource > jar, jars from {{--jars}} or {{spark.jars}}, will be downloaded to driver > local and then served to executors through file server running on driver. > When jars are big and the application requests a lot of executors, the > massive concurrent jars download from the driver will cause network > saturation. In this case, the executors jar download will timeout, causing > executors to be terminated. From user point of view, the application is > trapped in the loop of massive executor loss and re-provision, but never gets > enough live executors as requested, which leads to job SLA breach or > sometimes job failure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47628) Fix Postgres bit array issue 'Cannot cast to boolean'
[ https://issues.apache.org/jira/browse/SPARK-47628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47628: - Assignee: Kent Yao > Fix Postgres bit array issue 'Cannot cast to boolean' > - > > Key: SPARK-47628 > URL: https://issues.apache.org/jira/browse/SPARK-47628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > > {code:java} > [info] Cause: org.postgresql.util.PSQLException: Cannot cast to boolean: > "10101" > [info] at > org.postgresql.jdbc.BooleanTypeUtil.cannotCoerceException(BooleanTypeUtil.java:99) > [info] at > org.postgresql.jdbc.BooleanTypeUtil.fromString(BooleanTypeUtil.java:67) > [info] at > org.postgresql.jdbc.ArrayDecoding$7.parseValue(ArrayDecoding.java:267) > [info] at > org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:128) > [info] at > org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:763) > [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:320) > [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:179) > [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:116) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$25(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:561) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24$adapted(JdbcUtils.scala:545) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:365) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:346) > [info] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > [info] at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47628) Fix Postgres bit array issue 'Cannot cast to boolean'
[ https://issues.apache.org/jira/browse/SPARK-47628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47628. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45751 [https://github.com/apache/spark/pull/45751] > Fix Postgres bit array issue 'Cannot cast to boolean' > - > > Key: SPARK-47628 > URL: https://issues.apache.org/jira/browse/SPARK-47628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > [info] Cause: org.postgresql.util.PSQLException: Cannot cast to boolean: > "10101" > [info] at > org.postgresql.jdbc.BooleanTypeUtil.cannotCoerceException(BooleanTypeUtil.java:99) > [info] at > org.postgresql.jdbc.BooleanTypeUtil.fromString(BooleanTypeUtil.java:67) > [info] at > org.postgresql.jdbc.ArrayDecoding$7.parseValue(ArrayDecoding.java:267) > [info] at > org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:128) > [info] at > org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:763) > [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:320) > [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:179) > [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:116) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$25(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:561) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24$adapted(JdbcUtils.scala:545) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:365) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:346) > [info] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > [info] at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47622) Spark creates lot of tiny blocks for a single driverLog file of size less than a dfs.blocksize
[ https://issues.apache.org/jira/browse/SPARK-47622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831831#comment-17831831 ] Srinivasu Majeti commented on SPARK-47622: -- CCing [~vanzin] to look at it and guide on the next proceedings. Thank you! > Spark creates lot of tiny blocks for a single driverLog file of size less > than a dfs.blocksize > -- > > Key: SPARK-47622 > URL: https://issues.apache.org/jira/browse/SPARK-47622 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 3.3.2 >Reporter: Srinivasu Majeti >Priority: Major > > Upon reviewing the spark code, found that /user/spark/driverLogs are synced > to HDFS with hsync option as shown below. > {code:java} > hdfsStream.hsync(EnumSet.allOf(classOf[HdfsDataOutputStream.SyncFlag])) > Ref: > https://github.com/apache/spark/blob/a3c04ec1145662e4227d57cd953bffce96b8aad7/core/src/main/scala/org/apache/spark/util/logging/DriverLogger.scala{code} > As a result of this we see lot of tiny blocks getting synced every 5 seconds > with a new block. So we see a small HDFS file with 8 blocks as shown in the > below example. > {code:java} > [r...@ccycloud-3.smajeti.root.comops.site subdir0]# hdfs fsck > /user/spark/driverLogs/application_1710495774861_0002_driver.log > Connecting to namenode via > https://ccycloud-3.smajeti.root.comops.site:20102/fsck?ugi=hdfs=%2Fuser%2Fspark%2FdriverLogs%2Fapplication_1710495774861_0002_driver.log > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.140.136.139 for path > /user/spark/driverLogs/application_1710495774861_0002_driver.log at Thu Mar > 28 06:37:29 UTC 2024 > Status: HEALTHY > Number of data-nodes:4 > Number of racks: 1 > Total dirs: 0 > Total symlinks: 0 > Replicated Blocks: > Total size: 157574 B > Total files: 1 > Total blocks (validated):8 (avg. block size 19696 B) > Minimally replicated blocks: 8 (100.0 %) {code} > HdfsDataOutputStream.SyncFlag includes two flags UPDATE_LENGTH and END_BLOCK. > This has been an expected behavior for some time now and these flags will > help visualize the latest size of the HDFS Driver log file and to achieve > that, blocks are being ended/closed every 5-second sync. Every new sync will > create a new block for the same HDFS driver log file. This hysnc behavior was > started after fixing SPARK-29105 (SHS may delete driver log file of > in-progress application) 5 years back. > But this leaves Namenode to manage a lot of meta and becomes an overhead at > times in large clusters. > {code:java} > public static enum SyncFlag { > UPDATE_LENGTH, > END_BLOCK; > private SyncFlag() { > } > } > {code} > I don't see any configurable option to avoid this and avoiding this type of > hsync may have some side effects in spark as we saw SPARK-29105 bug. > We only have two options that needs manual intevention > 1. Keep cleaning these Driver logs after some time > 2. Keep merging these small block files into files with 128MB > Can we provide some customizable option to merge these blocks while closing > the spark-shell or during closing the driver log file? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`
[ https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47614: Assignee: BingKun Pan > Rename `JavaModuleOptions` to `JVMRuntimeOptions` > - > > Key: SPARK-47614 > URL: https://issues.apache.org/jira/browse/SPARK-47614 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47614) Rename `JavaModuleOptions` to `JVMRuntimeOptions`
[ https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47614. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45735 [https://github.com/apache/spark/pull/45735] > Rename `JavaModuleOptions` to `JVMRuntimeOptions` > - > > Key: SPARK-47614 > URL: https://issues.apache.org/jira/browse/SPARK-47614 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47628) Fix Postgres bit array issue 'Cannot cast to boolean'
[ https://issues.apache.org/jira/browse/SPARK-47628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47628: - Description: {code:java} [info] Cause: org.postgresql.util.PSQLException: Cannot cast to boolean: "10101" [info] at org.postgresql.jdbc.BooleanTypeUtil.cannotCoerceException(BooleanTypeUtil.java:99) [info] at org.postgresql.jdbc.BooleanTypeUtil.fromString(BooleanTypeUtil.java:67) [info] at org.postgresql.jdbc.ArrayDecoding$7.parseValue(ArrayDecoding.java:267) [info] at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:128) [info] at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:763) [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:320) [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:179) [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:116) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$25(JdbcUtils.scala:548) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:561) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24(JdbcUtils.scala:548) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24$adapted(JdbcUtils.scala:545) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:365) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:346) [info] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) [info] at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) {code} > Fix Postgres bit array issue 'Cannot cast to boolean' > - > > Key: SPARK-47628 > URL: https://issues.apache.org/jira/browse/SPARK-47628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > [info] Cause: org.postgresql.util.PSQLException: Cannot cast to boolean: > "10101" > [info] at > org.postgresql.jdbc.BooleanTypeUtil.cannotCoerceException(BooleanTypeUtil.java:99) > [info] at > org.postgresql.jdbc.BooleanTypeUtil.fromString(BooleanTypeUtil.java:67) > [info] at > org.postgresql.jdbc.ArrayDecoding$7.parseValue(ArrayDecoding.java:267) > [info] at > org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:128) > [info] at > org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:763) > [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:320) > [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:179) > [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:116) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$25(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:561) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24(JdbcUtils.scala:548) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$24$adapted(JdbcUtils.scala:545) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:365) > [info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:346) > [info] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > [info] at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47628) Fix Postgres bit array issue 'Cannot cast to boolean'
Kent Yao created SPARK-47628: Summary: Fix Postgres bit array issue 'Cannot cast to boolean' Key: SPARK-47628 URL: https://issues.apache.org/jira/browse/SPARK-47628 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47614) Update some outdated comments about JavaModuleOptions
[ https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47614: Component/s: Documentation > Update some outdated comments about JavaModuleOptions > - > > Key: SPARK-47614 > URL: https://issues.apache.org/jira/browse/SPARK-47614 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47629) Add `common/variant` and `connector/kinesis-asl` to maven daily test module list
[ https://issues.apache.org/jira/browse/SPARK-47629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-47629: - Summary: Add `common/variant` and `connector/kinesis-asl` to maven daily test module list (was: Add `common/variant` to maven daily test module list) > Add `common/variant` and `connector/kinesis-asl` to maven daily test module > list > > > Key: SPARK-47629 > URL: https://issues.apache.org/jira/browse/SPARK-47629 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47614) Update some outdated comments about JavaModuleOptions
[ https://issues.apache.org/jira/browse/SPARK-47614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47614: Summary: Update some outdated comments about JavaModuleOptions (was: Rename `JavaModuleOptions` to `JVMRuntimeOptions`) > Update some outdated comments about JavaModuleOptions > - > > Key: SPARK-47614 > URL: https://issues.apache.org/jira/browse/SPARK-47614 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47634) Legacy support for map normalization
[ https://issues.apache.org/jira/browse/SPARK-47634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47634: --- Labels: pull-request-available (was: ) > Legacy support for map normalization > > > Key: SPARK-47634 > URL: https://issues.apache.org/jira/browse/SPARK-47634 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stevo Mitric >Priority: Major > Labels: pull-request-available > > Add legacy support for creating a map without normalizing keys before > inserting in `ArrayBasedMapBuilder`. > > Key normalization change can be found in this PR: > https://issues.apache.org/jira/browse/SPARK-47563 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47630) Upgrade `zstd-jni` to 1.5.6-1
[ https://issues.apache.org/jira/browse/SPARK-47630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47630. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45756 [https://github.com/apache/spark/pull/45756] > Upgrade `zstd-jni` to 1.5.6-1 > - > > Key: SPARK-47630 > URL: https://issues.apache.org/jira/browse/SPARK-47630 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47632) Ban `com.amazonaws:aws-java-sdk-bundle` dependency
Dongjoon Hyun created SPARK-47632: - Summary: Ban `com.amazonaws:aws-java-sdk-bundle` dependency Key: SPARK-47632 URL: https://issues.apache.org/jira/browse/SPARK-47632 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47632) Ban `com.amazonaws:aws-java-sdk-bundle` dependency
[ https://issues.apache.org/jira/browse/SPARK-47632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47632: --- Labels: pull-request-available (was: ) > Ban `com.amazonaws:aws-java-sdk-bundle` dependency > -- > > Key: SPARK-47632 > URL: https://issues.apache.org/jira/browse/SPARK-47632 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47632) Ban `com.amazonaws:aws-java-sdk-bundle` dependency
[ https://issues.apache.org/jira/browse/SPARK-47632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47632. --- Fix Version/s: 4.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/45759 > Ban `com.amazonaws:aws-java-sdk-bundle` dependency > -- > > Key: SPARK-47632 > URL: https://issues.apache.org/jira/browse/SPARK-47632 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47632) Ban `com.amazonaws:aws-java-sdk-bundle` dependency
[ https://issues.apache.org/jira/browse/SPARK-47632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47632: - Assignee: Dongjoon Hyun > Ban `com.amazonaws:aws-java-sdk-bundle` dependency > -- > > Key: SPARK-47632 > URL: https://issues.apache.org/jira/browse/SPARK-47632 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47492) Relax definition of whitespace in lexer
[ https://issues.apache.org/jira/browse/SPARK-47492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-47492: -- Assignee: Serge Rielau > Relax definition of whitespace in lexer > --- > > Key: SPARK-47492 > URL: https://issues.apache.org/jira/browse/SPARK-47492 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > > There have been multiple incidences where queries "copied" in from other > sources resulted in "weird" syntax errors which ultimately boiled down to > whitespaces which the lexer does not recognize as such. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47492) Relax definition of whitespace in lexer
[ https://issues.apache.org/jira/browse/SPARK-47492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-47492. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45620 [https://github.com/apache/spark/pull/45620] > Relax definition of whitespace in lexer > --- > > Key: SPARK-47492 > URL: https://issues.apache.org/jira/browse/SPARK-47492 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > There have been multiple incidences where queries "copied" in from other > sources resulted in "weird" syntax errors which ultimately boiled down to > whitespaces which the lexer does not recognize as such. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
Dongjoon Hyun created SPARK-47635: - Summary: Use Java 21 instead of 21-jre in K8s Dockerfile Key: SPARK-47635 URL: https://issues.apache.org/jira/browse/SPARK-47635 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun {code} $ docker run -it --rm azul/zulu-openjdk:21-jre jmap docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "jmap": executable file not found in $PATH: unknown. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47631) Remove unused `SQLConf.parquetOutputCommitterClass` method
Dongjoon Hyun created SPARK-47631: - Summary: Remove unused `SQLConf.parquetOutputCommitterClass` method Key: SPARK-47631 URL: https://issues.apache.org/jira/browse/SPARK-47631 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47631) Remove unused `SQLConf.parquetOutputCommitterClass` method
[ https://issues.apache.org/jira/browse/SPARK-47631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47631: --- Labels: pull-request-available (was: ) > Remove unused `SQLConf.parquetOutputCommitterClass` method > -- > > Key: SPARK-47631 > URL: https://issues.apache.org/jira/browse/SPARK-47631 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47635: -- Affects Version/s: 3.5.1 3.5.0 > Use Java 21 instead of 21-jre in K8s Dockerfile > --- > > Key: SPARK-47635 > URL: https://issues.apache.org/jira/browse/SPARK-47635 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ docker run -it --rm azul/zulu-openjdk:21-jre jmap > docker: Error response from daemon: failed to create task for container: > failed to create shim task: OCI runtime create failed: runc create failed: > unable to start container process: exec: "jmap": executable file not found in > $PATH: unknown. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47634) Legacy support for map normalization
Stevo Mitric created SPARK-47634: Summary: Legacy support for map normalization Key: SPARK-47634 URL: https://issues.apache.org/jira/browse/SPARK-47634 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Stevo Mitric Add legacy support for creating a map without normalizing keys before inserting in `ArrayBasedMapBuilder`. Key normalization change can be found in this PR: https://issues.apache.org/jira/browse/SPARK-47563 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47635: - Assignee: Dongjoon Hyun > Use Java 21 instead of 21-jre in K8s Dockerfile > --- > > Key: SPARK-47635 > URL: https://issues.apache.org/jira/browse/SPARK-47635 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ docker run -it --rm azul/zulu-openjdk:21-jre jmap > docker: Error response from daemon: failed to create task for container: > failed to create shim task: OCI runtime create failed: runc create failed: > unable to start container process: exec: "jmap": executable file not found in > $PATH: unknown. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
Bruce Robbins created SPARK-47633: - Summary: Cache miss for queries using JOIN LATERAL with join condition Key: SPARK-47633 URL: https://issues.apache.org/jira/browse/SPARK-47633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Bruce Robbins For example: {noformat} CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); create or replace temp view v1 as select * from t1 join lateral ( select c1 as a, c2 as b from t2) on c1 = a; cache table v1; explain select * from v1; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false :- LocalTableScan [c1#180, c2#181] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=113] +- LocalTableScan [a#173, b#174] {noformat} Note that there is no {{InMemoryRelation}}. However, if you move the join condition into the subquery, the cached plan is used: {noformat} CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); create or replace temp view v2 as select * from t1 join lateral ( select c1 as a, c2 as b from t2 where t1.c1 = t2.c1); cache table v2; explain select * from v2; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(1) Project [c1#26, c2#27, a#19, b#20] +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false :- BroadcastQueryStage 0 : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=37] : +- LocalTableScan [c1#26, c2#27] +- *(1) LocalTableScan [a#19, b#20, c1#30] +- == Initial Plan == Project [c1#26, c2#27, a#19, b#20] +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=37] : +- LocalTableScan [c1#26, c2#27] +- LocalTableScan [a#19, b#20, c1#30] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
[ https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47633: -- Affects Version/s: 3.5.1 > Cache miss for queries using JOIN LATERAL with join condition > - > > Key: SPARK-47633 > URL: https://issues.apache.org/jira/browse/SPARK-47633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v1 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2) > on c1 = a; > cache table v1; > explain select * from v1; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false >:- LocalTableScan [c1#180, c2#181] >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [plan_id=113] > +- LocalTableScan [a#173, b#174] > {noformat} > Note that there is no {{InMemoryRelation}}. > However, if you move the join condition into the subquery, the cached plan is > used: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v2 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2 > where t1.c1 = t2.c1); > cache table v2; > explain select * from v2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] > +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > *(1) Project [c1#26, c2#27, a#19, b#20] > +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, > BuildLeft, false > :- BroadcastQueryStage 0 > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- *(1) LocalTableScan [a#19, b#20, c1#30] >+- == Initial Plan == > Project [c1#26, c2#27, a#19, b#20] > +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, > false > :- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- LocalTableScan [a#19, b#20, c1#30] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
[ https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47633: -- Affects Version/s: 3.4.2 > Cache miss for queries using JOIN LATERAL with join condition > - > > Key: SPARK-47633 > URL: https://issues.apache.org/jira/browse/SPARK-47633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v1 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2) > on c1 = a; > cache table v1; > explain select * from v1; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false >:- LocalTableScan [c1#180, c2#181] >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [plan_id=113] > +- LocalTableScan [a#173, b#174] > {noformat} > Note that there is no {{InMemoryRelation}}. > However, if you move the join condition into the subquery, the cached plan is > used: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v2 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2 > where t1.c1 = t2.c1); > cache table v2; > explain select * from v2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] > +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > *(1) Project [c1#26, c2#27, a#19, b#20] > +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, > BuildLeft, false > :- BroadcastQueryStage 0 > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- *(1) LocalTableScan [a#19, b#20, c1#30] >+- == Initial Plan == > Project [c1#26, c2#27, a#19, b#20] > +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, > false > :- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- LocalTableScan [a#19, b#20, c1#30] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47636) Use Java 17 instead of 17-jre image in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47636: --- Labels: pull-request-available (was: ) > Use Java 17 instead of 17-jre image in K8s Dockerfile > - > > Key: SPARK-47636 > URL: https://issues.apache.org/jira/browse/SPARK-47636 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.5.0, 3.5.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47635: -- Affects Version/s: (was: 3.5.0) (was: 3.5.1) > Use Java 21 instead of 21-jre in K8s Dockerfile > --- > > Key: SPARK-47635 > URL: https://issues.apache.org/jira/browse/SPARK-47635 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ docker run -it --rm azul/zulu-openjdk:21-jre jmap > docker: Error response from daemon: failed to create task for container: > failed to create shim task: OCI runtime create failed: runc create failed: > unable to start container process: exec: "jmap": executable file not found in > $PATH: unknown. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47636) Use Java 17 instead of 17-jre image in K8s Dockerfile
Dongjoon Hyun created SPARK-47636: - Summary: Use Java 17 instead of 17-jre image in K8s Dockerfile Key: SPARK-47636 URL: https://issues.apache.org/jira/browse/SPARK-47636 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.5.1, 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
[ https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47633: --- Labels: pull-request-available (was: ) > Cache miss for queries using JOIN LATERAL with join condition > - > > Key: SPARK-47633 > URL: https://issues.apache.org/jira/browse/SPARK-47633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > For example: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v1 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2) > on c1 = a; > cache table v1; > explain select * from v1; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false >:- LocalTableScan [c1#180, c2#181] >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [plan_id=113] > +- LocalTableScan [a#173, b#174] > {noformat} > Note that there is no {{InMemoryRelation}}. > However, if you move the join condition into the subquery, the cached plan is > used: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v2 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2 > where t1.c1 = t2.c1); > cache table v2; > explain select * from v2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] > +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > *(1) Project [c1#26, c2#27, a#19, b#20] > +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, > BuildLeft, false > :- BroadcastQueryStage 0 > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- *(1) LocalTableScan [a#19, b#20, c1#30] >+- == Initial Plan == > Project [c1#26, c2#27, a#19, b#20] > +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, > false > :- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- LocalTableScan [a#19, b#20, c1#30] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47525) Support subquery correlation joining on map attributes
[ https://issues.apache.org/jira/browse/SPARK-47525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47525. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45673 [https://github.com/apache/spark/pull/45673] > Support subquery correlation joining on map attributes > -- > > Key: SPARK-47525 > URL: https://issues.apache.org/jira/browse/SPARK-47525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, when a subquery is correlated on a condition like `outer_map[1] = > inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, > which is unsupported, so the query cannot run - for example: > > {code:java} > scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select > v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 > where v1.value[1] = v2.value[1])").explain > org.apache.spark.sql.AnalysisException: > [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] > Unsupported subquery expression: Correlated column reference 'v1.value' > cannot be map type. SQLSTATE: 0A000; line 1 pos 49 > at > org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) > ... {code} > However, if we rewrite the query to pull out the map access `outer_map[1]` > into the outer plan, it succeeds: > > {code:java} > scala> sql("""with tmp as ( > select value[0] as value0, value[1] as value1 from v > ) > select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from > tmp v2 where v1.value1 = v2.value1)""").explain{code} > Another point that can be improved is that, even if the data type supports > join, we still don’t need to join on the full attribute, and we can get a > better plan by doing the same rewrite to pull out the extract expression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47638) Skip column name validation in PS
[ https://issues.apache.org/jira/browse/SPARK-47638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-47638: - Assignee: Ruifeng Zheng > Skip column name validation in PS > - > > Key: SPARK-47638 > URL: https://issues.apache.org/jira/browse/SPARK-47638 > Project: Spark > Issue Type: Improvement > Components: Connect, PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47638) Skip column name validation in PS
[ https://issues.apache.org/jira/browse/SPARK-47638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-47638. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45752 [https://github.com/apache/spark/pull/45752] > Skip column name validation in PS > - > > Key: SPARK-47638 > URL: https://issues.apache.org/jira/browse/SPARK-47638 > Project: Spark > Issue Type: Improvement > Components: Connect, PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47638) Skip column name validation in PS
[ https://issues.apache.org/jira/browse/SPARK-47638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47638: --- Labels: pull-request-available (was: ) > Skip column name validation in PS > - > > Key: SPARK-47638 > URL: https://issues.apache.org/jira/browse/SPARK-47638 > Project: Spark > Issue Type: Improvement > Components: Connect, PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47639) Support codegen for json_tuple
Xianming Lei created SPARK-47639: Summary: Support codegen for json_tuple Key: SPARK-47639 URL: https://issues.apache.org/jira/browse/SPARK-47639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Xianming Lei Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47640) Support codegen for json_tuple
Xianming Lei created SPARK-47640: Summary: Support codegen for json_tuple Key: SPARK-47640 URL: https://issues.apache.org/jira/browse/SPARK-47640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Xianming Lei Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47640) Support codegen for json_tuple
[ https://issues.apache.org/jira/browse/SPARK-47640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianming Lei resolved SPARK-47640. -- Resolution: Duplicate > Support codegen for json_tuple > -- > > Key: SPARK-47640 > URL: https://issues.apache.org/jira/browse/SPARK-47640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Xianming Lei >Priority: Major > > Sometimes using json_tuple may cause performance regression because it does > not support whole stage codegen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47639) Support codegen for json_tuple
[ https://issues.apache.org/jira/browse/SPARK-47639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47639: --- Labels: pull-request-available (was: ) > Support codegen for json_tuple > -- > > Key: SPARK-47639 > URL: https://issues.apache.org/jira/browse/SPARK-47639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Xianming Lei >Priority: Major > Labels: pull-request-available > > Sometimes using json_tuple may cause performance regression because it does > not support whole stage codegen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47637) Use errorCapturingIdentifier rule in more places to improve error messages
Serge Rielau created SPARK-47637: Summary: Use errorCapturingIdentifier rule in more places to improve error messages Key: SPARK-47637 URL: https://issues.apache.org/jira/browse/SPARK-47637 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Serge Rielau errorCapturingIdentifier parses identifier with included '-' to raise INVALID_IDENTIFIER instead of SYNTAX_ERROR for non-delimited identifiers containing a hyphen. It is meant to be used wherever the context is not that of an expression This Jira replaces a few missed identifiers with that rule. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47635. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45761 [https://github.com/apache/spark/pull/45761] > Use Java 21 instead of 21-jre in K8s Dockerfile > --- > > Key: SPARK-47635 > URL: https://issues.apache.org/jira/browse/SPARK-47635 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > $ docker run -it --rm azul/zulu-openjdk:21-jre jmap > docker: Error response from daemon: failed to create task for container: > failed to create shim task: OCI runtime create failed: runc create failed: > unable to start container process: exec: "jmap": executable file not found in > $PATH: unknown. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47631) Remove unused `SQLConf.parquetOutputCommitterClass` method
[ https://issues.apache.org/jira/browse/SPARK-47631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47631. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45757 [https://github.com/apache/spark/pull/45757] > Remove unused `SQLConf.parquetOutputCommitterClass` method > -- > > Key: SPARK-47631 > URL: https://issues.apache.org/jira/browse/SPARK-47631 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47525) Support subquery correlation joining on map attributes
[ https://issues.apache.org/jira/browse/SPARK-47525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47525: --- Assignee: Jack Chen > Support subquery correlation joining on map attributes > -- > > Key: SPARK-47525 > URL: https://issues.apache.org/jira/browse/SPARK-47525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > > Currently, when a subquery is correlated on a condition like `outer_map[1] = > inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, > which is unsupported, so the query cannot run - for example: > > {code:java} > scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select > v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 > where v1.value[1] = v2.value[1])").explain > org.apache.spark.sql.AnalysisException: > [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] > Unsupported subquery expression: Correlated column reference 'v1.value' > cannot be map type. SQLSTATE: 0A000; line 1 pos 49 > at > org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) > ... {code} > However, if we rewrite the query to pull out the map access `outer_map[1]` > into the outer plan, it succeeds: > > {code:java} > scala> sql("""with tmp as ( > select value[0] as value0, value[1] as value1 from v > ) > select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from > tmp v2 where v1.value1 = v2.value1)""").explain{code} > Another point that can be improved is that, even if the data type supports > join, we still don’t need to join on the full attribute, and we can get a > better plan by doing the same rewrite to pull out the extract expression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47637) Use errorCapturingIdentifier rule in more places to improve error messages
[ https://issues.apache.org/jira/browse/SPARK-47637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47637: --- Labels: pull-request-available (was: ) > Use errorCapturingIdentifier rule in more places to improve error messages > -- > > Key: SPARK-47637 > URL: https://issues.apache.org/jira/browse/SPARK-47637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > > errorCapturingIdentifier parses identifier with included '-' to raise > INVALID_IDENTIFIER > instead of SYNTAX_ERROR for non-delimited identifiers containing a hyphen. > It is meant to be used wherever the context is not that of an expression > This Jira replaces a few missed identifiers with that rule. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47623) Use `QuietTest` in parity tests
[ https://issues.apache.org/jira/browse/SPARK-47623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47623. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45747 [https://github.com/apache/spark/pull/45747] > Use `QuietTest` in parity tests > --- > > Key: SPARK-47623 > URL: https://issues.apache.org/jira/browse/SPARK-47623 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47511) Canonicalize With expressions by re-assigning IDs
[ https://issues.apache.org/jira/browse/SPARK-47511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47511. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45649 [https://github.com/apache/spark/pull/45649] > Canonicalize With expressions by re-assigning IDs > - > > Key: SPARK-47511 > URL: https://issues.apache.org/jira/browse/SPARK-47511 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kelvin Jiang >Assignee: Kelvin Jiang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The current canonicalization of `With` expressions takes into account the ID > of the common expressions, which comes from a global monotonically increasing > ID. This means that queries with `With` expressions (e.g. `NULLIF` > expressions) will have inconsistent canonicalizations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47636) Use Java 17 instead of 17-jre image in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47636. --- Fix Version/s: 3.5.2 Resolution: Fixed Issue resolved by pull request 45762 [https://github.com/apache/spark/pull/45762] > Use Java 17 instead of 17-jre image in K8s Dockerfile > - > > Key: SPARK-47636 > URL: https://issues.apache.org/jira/browse/SPARK-47636 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.5.0, 3.5.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47631) Remove unused `SQLConf.parquetOutputCommitterClass` method
[ https://issues.apache.org/jira/browse/SPARK-47631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47631: Assignee: Dongjoon Hyun > Remove unused `SQLConf.parquetOutputCommitterClass` method > -- > > Key: SPARK-47631 > URL: https://issues.apache.org/jira/browse/SPARK-47631 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47638) Skip column name validation in PS
Ruifeng Zheng created SPARK-47638: - Summary: Skip column name validation in PS Key: SPARK-47638 URL: https://issues.apache.org/jira/browse/SPARK-47638 Project: Spark Issue Type: Improvement Components: Connect, PS Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47642) Exclude `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5`
[ https://issues.apache.org/jira/browse/SPARK-47642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47642: --- Labels: pull-request-available (was: ) > Exclude `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5` > > > Key: SPARK-47642 > URL: https://issues.apache.org/jira/browse/SPARK-47642 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47644) Refine docstrings of try_*
[ https://issues.apache.org/jira/browse/SPARK-47644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47644: - Summary: Refine docstrings of try_* (was: Improve docstrings of try_*) > Refine docstrings of try_* > -- > > Key: SPARK-47644 > URL: https://issues.apache.org/jira/browse/SPARK-47644 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47642) Exclude `junit-jupiter-api` and `org.junit.platform` from `jmock-junit5`
Yang Jie created SPARK-47642: Summary: Exclude `junit-jupiter-api` and `org.junit.platform` from `jmock-junit5` Key: SPARK-47642 URL: https://issues.apache.org/jira/browse/SPARK-47642 Project: Spark Issue Type: Bug Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47642) Exclude `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5`
[ https://issues.apache.org/jira/browse/SPARK-47642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-47642: - Summary: Exclude `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5` (was: Exclude `junit-jupiter-api` and `org.junit.platform` from `jmock-junit5`) > Exclude `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5` > > > Key: SPARK-47642 > URL: https://issues.apache.org/jira/browse/SPARK-47642 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47643) Add pyspark test for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-47643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47643: --- Labels: pull-request-available (was: ) > Add pyspark test for python streaming data source > - > > Key: SPARK-47643 > URL: https://issues.apache.org/jira/browse/SPARK-47643 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Add pyspark end to end test for Python streaming dada source in pure python > environment. Currently there are only scala tests for python streaming data > source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47644) Refine docstrings of try_*
[ https://issues.apache.org/jira/browse/SPARK-47644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47644: --- Labels: pull-request-available (was: ) > Refine docstrings of try_* > -- > > Key: SPARK-47644 > URL: https://issues.apache.org/jira/browse/SPARK-47644 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47644) Improve docstrings of try_*
Hyukjin Kwon created SPARK-47644: Summary: Improve docstrings of try_* Key: SPARK-47644 URL: https://issues.apache.org/jira/browse/SPARK-47644 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47568) Fix race condition between maintenance thread and task thead for RocksDB snapshot
[ https://issues.apache.org/jira/browse/SPARK-47568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47568. -- Fix Version/s: 4.0.0 Assignee: Bhuwan Sahni Resolution: Fixed Issue resolved via https://github.com/apache/spark/pull/45724 > Fix race condition between maintenance thread and task thead for RocksDB > snapshot > - > > Key: SPARK-47568 > URL: https://issues.apache.org/jira/browse/SPARK-47568 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > There are currently some race conditions between maintenance thread and task > thread which can result in corrupted checkpoint state. > # The maintenance thread currently relies on class variable {{lastSnapshot}} > to find the latest checkpoint and uploads it to DFS. This checkpoint can be > modified at commit time by Task thread if a new snapshot is created. > # The task thread does not reset lastSnapshot at load time, which can result > in newer snapshots (if a old version is loaded) being considered valid and > uploaded to DFS. This results in VersionIdMismatch errors. > This issue proposes to fix these issues by guarding latestSnapshot variable > modification, and setting latestSnapshot properly at load time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47629) Add `common/variant` and `connector/kinesis-asl` to maven daily test module list
[ https://issues.apache.org/jira/browse/SPARK-47629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-47629: Assignee: Yang Jie > Add `common/variant` and `connector/kinesis-asl` to maven daily test module > list > > > Key: SPARK-47629 > URL: https://issues.apache.org/jira/browse/SPARK-47629 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47629) Add `common/variant` and `connector/kinesis-asl` to maven daily test module list
[ https://issues.apache.org/jira/browse/SPARK-47629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-47629. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45754 [https://github.com/apache/spark/pull/45754] > Add `common/variant` and `connector/kinesis-asl` to maven daily test module > list > > > Key: SPARK-47629 > URL: https://issues.apache.org/jira/browse/SPARK-47629 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47635) Use Java 21 instead of 21-jre in K8s Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-47635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47635: --- Labels: pull-request-available (was: ) > Use Java 21 instead of 21-jre in K8s Dockerfile > --- > > Key: SPARK-47635 > URL: https://issues.apache.org/jira/browse/SPARK-47635 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ docker run -it --rm azul/zulu-openjdk:21-jre jmap > docker: Error response from daemon: failed to create task for container: > failed to create shim task: OCI runtime create failed: runc create failed: > unable to start container process: exec: "jmap": executable file not found in > $PATH: unknown. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org