[jira] [Commented] (SPARK-40048) Partitions are traversed multiple times invalidating Accumulator consistency
[ https://issues.apache.org/jira/browse/SPARK-40048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579214#comment-17579214 ] Hyukjin Kwon commented on SPARK-40048: -- Spark 2.4 is EOL. mind trying Spark 3.1+? > Partitions are traversed multiple times invalidating Accumulator consistency > > > Key: SPARK-40048 > URL: https://issues.apache.org/jira/browse/SPARK-40048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: sam >Priority: Major > > We are trying to use Accumulators to count RDDs without having to force > `.count()` on them for efficiency reasons. We are aware tasks can fail and > re-run, which will invalidate the value of the accumulator, so we count the > number of times a partition has been traversed, so we can detect this. > The problem is that partitions are being traversed multiple times even though > - We cache the RDD in memory _after we have applied the logic below_ > - No tasks are failing, no executors are dying. > - There is plenty of memory (no RDD eviction) > The code we use: > ``` > val count: LongAccumulator > val partitionTraverseCounts: List[LongAccumulator] > def incrementTimesCalled(partitionIndex: Int): Unit = > partitionTraverseCounts(partitionIndex).add(1) > def incrementForPartition[T](index: Int, it: Iterator[T]): Iterator[T] = { > incrementTimesCalled(index) > it.map { x => > increment() > x > } > } > ``` > How we use the above: > ``` > rdd.mapPartitionsWithIndex(safeCounter.incrementForPartition) > ``` > We have a 50 partition RDD, and we frequently see odd traverse counts: > ``` > traverseCounts: List(2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, > 2, 2, 2, 1, 2) > ``` > As you can see, some partitions are traversed twice, while others are > traversed only once. > To confirm no task failures: > ``` > cat job.log | grep -i task | grep -i fail > ``` > To confirm no memory issues: > ``` > cat job.log | grep -i memory > ``` > We see every log line has multiple GB memory free. > We also don't see any errors or exceptions. > Question: > 1. Why is spark traversing a cached RDD multiple times? > 2. Is there any way to disable this? > Many thanks, > Sam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579213#comment-17579213 ] Hyukjin Kwon commented on SPARK-40063: -- {quote} it ends up mixing the column's rows ordering. {quote} Can you show the expected/actual output? What's column's rows ordering? > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40068) Extend new heartbeat mechanism to YARN
[ https://issues.apache.org/jira/browse/SPARK-40068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40068: - Component/s: YARN (was: Spark Core) > Extend new heartbeat mechanism to YARN > -- > > Key: SPARK-40068 > URL: https://issues.apache.org/jira/browse/SPARK-40068 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.4.0 >Reporter: Kai-Hsun Chen >Priority: Major > > Extend the new heartbeat mechanism in SPARK-39984 to YARN. > > SPARK-39984 issue: > [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39984?filter=allopenissues] > > SPARK-39984 PR: > [https://github.com/apache/spark/pull/37411] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40069) Extend the new heartbeat mechanism to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-40069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40069: - Component/s: Kubernetes (was: Spark Core) > Extend the new heartbeat mechanism to Kubernetes > > > Key: SPARK-40069 > URL: https://issues.apache.org/jira/browse/SPARK-40069 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Kai-Hsun Chen >Priority: Major > > Extend the new heartbeat mechanism in SPARK-39984 to Kubernetes. > > SPARK-39984 issue: > [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39984?filter=allopenissues] > > SPARK-39984 PR: > [https://github.com/apache/spark/pull/37411] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40061) Document cast of ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40061. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37495 [https://github.com/apache/spark/pull/37495] > Document cast of ANSI intervals > --- > > Key: SPARK-40061 > URL: https://issues.apache.org/jira/browse/SPARK-40061 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Update the doc page > https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast > regarding cast of ANSI intervals to/from decimals/integrals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40069) Extend the new heartbeat mechanism to Kubernetes
Kai-Hsun Chen created SPARK-40069: - Summary: Extend the new heartbeat mechanism to Kubernetes Key: SPARK-40069 URL: https://issues.apache.org/jira/browse/SPARK-40069 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Kai-Hsun Chen Extend the new heartbeat mechanism in SPARK-39984 to Kubernetes. SPARK-39984 issue: [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39984?filter=allopenissues] SPARK-39984 PR: [https://github.com/apache/spark/pull/37411] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40067) Add table name to Spark plan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579209#comment-17579209 ] Apache Spark commented on SPARK-40067: -- User 'sumeetgajjar' has created a pull request for this issue: https://github.com/apache/spark/pull/37505 > Add table name to Spark plan node in SparkUI > > > Key: SPARK-40067 > URL: https://issues.apache.org/jira/browse/SPARK-40067 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Sumeet >Priority: Major > > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced > `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node > in SparkUI. > However, a better suggestion was to use the `Table#name()`. Furthermore, we > can also extract other useful information `Table` thus revert > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use > `Table` to fetch relevant information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40067) Add table name to Spark plan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40067: Assignee: (was: Apache Spark) > Add table name to Spark plan node in SparkUI > > > Key: SPARK-40067 > URL: https://issues.apache.org/jira/browse/SPARK-40067 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Sumeet >Priority: Major > > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced > `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node > in SparkUI. > However, a better suggestion was to use the `Table#name()`. Furthermore, we > can also extract other useful information `Table` thus revert > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use > `Table` to fetch relevant information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40067) Add table name to Spark plan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40067: Assignee: Apache Spark > Add table name to Spark plan node in SparkUI > > > Key: SPARK-40067 > URL: https://issues.apache.org/jira/browse/SPARK-40067 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Sumeet >Assignee: Apache Spark >Priority: Major > > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced > `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node > in SparkUI. > However, a better suggestion was to use the `Table#name()`. Furthermore, we > can also extract other useful information `Table` thus revert > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use > `Table` to fetch relevant information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40068) Extend new heartbeat mechanism to YARN
Kai-Hsun Chen created SPARK-40068: - Summary: Extend new heartbeat mechanism to YARN Key: SPARK-40068 URL: https://issues.apache.org/jira/browse/SPARK-40068 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Kai-Hsun Chen Extend the new heartbeat mechanism in SPARK-39984 to YARN. SPARK-39984 issue: [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-39984?filter=allopenissues] SPARK-39984 PR: [https://github.com/apache/spark/pull/37411] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40067) Add table name to Spark plan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-40067: --- Description: [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node in SparkUI. However, a better suggestion was to use the `Table#name()`. Furthermore, we can also extract other useful information `Table` thus revert [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use `Table` to fetch relevant information. > Add table name to Spark plan node in SparkUI > > > Key: SPARK-40067 > URL: https://issues.apache.org/jira/browse/SPARK-40067 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Sumeet >Priority: Major > > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced > `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node > in SparkUI. > However, a better suggestion was to use the `Table#name()`. Furthermore, we > can also extract other useful information `Table` thus revert > [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use > `Table` to fetch relevant information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40067) Add table name to Spark plan node in SparkUI
Sumeet created SPARK-40067: -- Summary: Add table name to Spark plan node in SparkUI Key: SPARK-40067 URL: https://issues.apache.org/jira/browse/SPARK-40067 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.4.0 Reporter: Sumeet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579202#comment-17579202 ] Apache Spark commented on SPARK-40065: -- User 'nsuke' has created a pull request for this issue: https://github.com/apache/spark/pull/37504 > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40065: Assignee: (was: Apache Spark) > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579201#comment-17579201 ] Apache Spark commented on SPARK-40065: -- User 'nsuke' has created a pull request for this issue: https://github.com/apache/spark/pull/37504 > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40065: Assignee: Apache Spark > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Assignee: Apache Spark >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Description: When executor config map is made optional in SPARK-34316, mount volume is unconditionally disabled erroneously when non-default profile is used. When spark.kubernetes.executor.disableConfigMap is false, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it is not mounted if the resource profile is non-default. was: When executor config map is made optional in SPARK-34316, mount volume is unconditionally disabled erroneously when non-default profile is used. When spark.kubernetes.executor.disableConfigMap is false, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it was not mounted if the resource profile is non-default. > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Description: When executor config map is made optional in SPARK-34316, mount volume is unconditionally disabled erroneously when non-default profile is used. When spark.kubernetes.executor.disableConfigMap is false, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it was not mounted if the resource profile is non-default. was:When the resource profile is non-default, executor configmap is not mounted even if spark.kubernetes.executor.disableConfigMap is false. > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it was not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40066) ANSI mode: always return null on invalid access to map column
[ https://issues.apache.org/jira/browse/SPARK-40066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40066: Assignee: Gengliang Wang (was: Apache Spark) > ANSI mode: always return null on invalid access to map column > - > > Key: SPARK-40066 > URL: https://issues.apache.org/jira/browse/SPARK-40066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Since https://github.com/apache/spark/pull/30386, Spark always throws an > error on invalid access to a map column. There is no such syntax in the ANSI > SQL standard since there is no Map type in it. There is a similar type > `multiset` which returns null on non-existing element access. > Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns > null return on map(json) key not exists. > I suggest loosen the the syntax here. When users get the error, most of them > will just use `try_element_at()` to get the same syntax or just turn off the > ANSI SQL mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40066) ANSI mode: always return null on invalid access to map column
[ https://issues.apache.org/jira/browse/SPARK-40066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40066: Assignee: Apache Spark (was: Gengliang Wang) > ANSI mode: always return null on invalid access to map column > - > > Key: SPARK-40066 > URL: https://issues.apache.org/jira/browse/SPARK-40066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Since https://github.com/apache/spark/pull/30386, Spark always throws an > error on invalid access to a map column. There is no such syntax in the ANSI > SQL standard since there is no Map type in it. There is a similar type > `multiset` which returns null on non-existing element access. > Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns > null return on map(json) key not exists. > I suggest loosen the the syntax here. When users get the error, most of them > will just use `try_element_at()` to get the same syntax or just turn off the > ANSI SQL mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40066) ANSI mode: always return null on invalid access to map column
[ https://issues.apache.org/jira/browse/SPARK-40066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579200#comment-17579200 ] Apache Spark commented on SPARK-40066: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37503 > ANSI mode: always return null on invalid access to map column > - > > Key: SPARK-40066 > URL: https://issues.apache.org/jira/browse/SPARK-40066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Since https://github.com/apache/spark/pull/30386, Spark always throws an > error on invalid access to a map column. There is no such syntax in the ANSI > SQL standard since there is no Map type in it. There is a similar type > `multiset` which returns null on non-existing element access. > Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns > null return on map(json) key not exists. > I suggest loosen the the syntax here. When users get the error, most of them > will just use `try_element_at()` to get the same syntax or just turn off the > ANSI SQL mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40066) ANSI mode: always return null on invalid access to map column
[ https://issues.apache.org/jira/browse/SPARK-40066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-40066: --- Description: Since https://github.com/apache/spark/pull/30386, Spark always throws an error on invalid access to a map column. There is no such syntax in the ANSI SQL standard since there is no Map type in it. There is a similar type `multiset` which returns null on non-existing element access. Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns null return on map(json) key not exists. I suggest loosen the the syntax here. When users get the error, most of them will just use `try_element_at()` to get the same syntax or just turn off the ANSI SQL mode. > ANSI mode: always return null on invalid access to map column > - > > Key: SPARK-40066 > URL: https://issues.apache.org/jira/browse/SPARK-40066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Since https://github.com/apache/spark/pull/30386, Spark always throws an > error on invalid access to a map column. There is no such syntax in the ANSI > SQL standard since there is no Map type in it. There is a similar type > `multiset` which returns null on non-existing element access. > Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns > null return on map(json) key not exists. > I suggest loosen the the syntax here. When users get the error, most of them > will just use `try_element_at()` to get the same syntax or just turn off the > ANSI SQL mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Description: When the resource profile is non-default, executor configmap is not mounted even if spark.kubernetes.executor.disableConfigMap is false. (was: When the resource profile is non-default, executor configmap is not created even if spark.kubernetes.executor.disableConfigMap is false.) > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When the resource profile is non-default, executor configmap is not mounted > even if spark.kubernetes.executor.disableConfigMap is false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Summary: Executor ConfigMap is not mounted if profile is not default (was: Executor ConfigMap is not created if profile is not default) > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When the resource profile is non-default, executor configmap is not created > even if spark.kubernetes.executor.disableConfigMap is false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
[ https://issues.apache.org/jira/browse/SPARK-40049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40049. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37500 [https://github.com/apache/spark/pull/37500] > Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite > -- > > Key: SPARK-40049 > URL: https://issues.apache.org/jira/browse/SPARK-40049 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.4.0 > > > Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that > adaptive query execution is turned off. We should add cases > `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
[ https://issues.apache.org/jira/browse/SPARK-40049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40049: - Assignee: Kazuyuki Tanimura > Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite > -- > > Key: SPARK-40049 > URL: https://issues.apache.org/jira/browse/SPARK-40049 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > > Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that > adaptive query execution is turned off. We should add cases > `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not created if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Affects Version/s: 3.2.1 3.2.0 > Executor ConfigMap is not created if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When the resource profile is non-default, executor configmap is not created > even if spark.kubernetes.executor.disableConfigMap is false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40065) Executor ConfigMap is not created if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nobuaki Sukegawa updated SPARK-40065: - Affects Version/s: 3.2.2 > Executor ConfigMap is not created if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > > When the resource profile is non-default, executor configmap is not created > even if spark.kubernetes.executor.disableConfigMap is false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40066) ANSI mode: always return null on invalid access to map column
Gengliang Wang created SPARK-40066: -- Summary: ANSI mode: always return null on invalid access to map column Key: SPARK-40066 URL: https://issues.apache.org/jira/browse/SPARK-40066 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40037. -- Fix Version/s: 3.4.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37473 > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40037: - Priority: Minor (was: Major) > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Minor > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40037: Assignee: Bjørn Jørgensen > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38969) Graceful decomissionning on Kubernetes fails / decom script error
[ https://issues.apache.org/jira/browse/SPARK-38969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-38969. -- Fix Version/s: 3.4.0 Assignee: Holden Karau Resolution: Fixed Updated decommissioning script to be more resilent and block as long as it takes on the executor to exit. K8s will still kill the pod if it exceeds the graceful shutdown time-limit so we don't have to worry too much about blocking forever there. Also updated how we tag executor loss reasons for executors which decommission too "quickly" See https://github.com/apache/spark/pull/36434/files > Graceful decomissionning on Kubernetes fails / decom script error > - > > Key: SPARK-38969 > URL: https://issues.apache.org/jira/browse/SPARK-38969 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 > Environment: Running spark-thriftserver (3.2.0) on Kubernetes (GKE > 1.20.15-gke.2500). > >Reporter: Yeachan Park >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Hello, we are running into some issue while attempting graceful > decommissioning of executors. We enabled: > * spark.decommission.enabled > * spark.storage.decommission.rddBlocks.enabled > * spark.storage.decommission.shuffleBlocks.enabled > * spark.storage.decommission.enabled > and set spark.storage.decommission.fallbackStorage.path to a path in our > bucket. > > The logs from the driver seems to suggest the decommissioning process started > but then unexpectedly exited and failed: > > ``` > 22/04/20 15:09:09 WARN > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor > 3 decommissioned message > 22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission > executors: 3 > 22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning. > 22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: > Executor decommission. > 22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2) > 22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason > statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver > killed: 0, unexpectedly exited: 3). > 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor > 3 from BlockManagerMaster. > 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(3, 100.96.1.130, 44789, None) > 22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in > removeExecutor > 22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 > (epoch 2) > ``` > > However, the executor logs seem to suggest that decommissioning was > successful: > > ``` > 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3. > 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished > decommissioning > 22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning > process... > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > RDD blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > shuffle blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing > migratable shuffle blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are > added. In total, 0 shuffles are remained. > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > cached RDD blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block > migration thread for BlockManagerId(4, 100.96.1.131, 35607, None) > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block > migration thread for BlockManagerId(fallback, remote, 7337, None) > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round > refreshing migratable shuffle blocks, waiting for 3ms before the next > round refreshing. > 22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD > cache blocks, but no blocks to migrate > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD > blocks migration, waiting for 3ms before the next round migration. > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we > can shutdown. > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, > checking migrations > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all > blocks migrated, stopping. > 22/04/20 15:09:10 ERROR
[jira] [Created] (SPARK-40065) Executor ConfigMap is not created if profile is not default
Nobuaki Sukegawa created SPARK-40065: Summary: Executor ConfigMap is not created if profile is not default Key: SPARK-40065 URL: https://issues.apache.org/jira/browse/SPARK-40065 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0 Reporter: Nobuaki Sukegawa When the resource profile is non-default, executor configmap is not created even if spark.kubernetes.executor.disableConfigMap is false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40052) Handle direct byte buffers in VectorizedDeltaBinaryPackedReader
[ https://issues.apache.org/jira/browse/SPARK-40052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-40052: Assignee: Ivan Sadikov > Handle direct byte buffers in VectorizedDeltaBinaryPackedReader > --- > > Key: SPARK-40052 > URL: https://issues.apache.org/jira/browse/SPARK-40052 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40052) Handle direct byte buffers in VectorizedDeltaBinaryPackedReader
[ https://issues.apache.org/jira/browse/SPARK-40052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-40052. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37485 [https://github.com/apache/spark/pull/37485] > Handle direct byte buffers in VectorizedDeltaBinaryPackedReader > --- > > Key: SPARK-40052 > URL: https://issues.apache.org/jira/browse/SPARK-40052 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40064) Use V2 Filter in SupportsOverwrite
[ https://issues.apache.org/jira/browse/SPARK-40064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40064: Assignee: (was: Apache Spark) > Use V2 Filter in SupportsOverwrite > -- > > Key: SPARK-40064 > URL: https://issues.apache.org/jira/browse/SPARK-40064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40064) Use V2 Filter in SupportsOverwrite
[ https://issues.apache.org/jira/browse/SPARK-40064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40064: Assignee: Apache Spark > Use V2 Filter in SupportsOverwrite > -- > > Key: SPARK-40064 > URL: https://issues.apache.org/jira/browse/SPARK-40064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40064) Use V2 Filter in SupportsOverwrite
[ https://issues.apache.org/jira/browse/SPARK-40064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579150#comment-17579150 ] Apache Spark commented on SPARK-40064: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37502 > Use V2 Filter in SupportsOverwrite > -- > > Key: SPARK-40064 > URL: https://issues.apache.org/jira/browse/SPARK-40064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40064) Use V2 Filter in SupportsOverwrite
[ https://issues.apache.org/jira/browse/SPARK-40064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579149#comment-17579149 ] Apache Spark commented on SPARK-40064: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37502 > Use V2 Filter in SupportsOverwrite > -- > > Key: SPARK-40064 > URL: https://issues.apache.org/jira/browse/SPARK-40064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40064) Use V2 Filter in SupportsOverwrite
Huaxin Gao created SPARK-40064: -- Summary: Use V2 Filter in SupportsOverwrite Key: SPARK-40064 URL: https://issues.apache.org/jira/browse/SPARK-40064 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Huaxin Gao Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39528) Use V2 Filter in SupportsRuntimeFiltering
[ https://issues.apache.org/jira/browse/SPARK-39528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-39528: --- Parent: SPARK-36555 Issue Type: Sub-task (was: Improvement) > Use V2 Filter in SupportsRuntimeFiltering > - > > Key: SPARK-39528 > URL: https://issues.apache.org/jira/browse/SPARK-39528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > > Currently, SupportsRuntimeFiltering uses v1 filter. We should use v2 filter > instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39966) Use V2 Filter in SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-39966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-39966: --- Parent: SPARK-36555 Issue Type: Sub-task (was: Improvement) > Use V2 Filter in SupportsDelete > --- > > Key: SPARK-39966 > URL: https://issues.apache.org/jira/browse/SPARK-39966 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > > Spark currently uses V1 Filter in SupportsDelete. Add V2 Filter support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1) {code} A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced. Setting one column as index also didn't work. was: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1) {code} A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Language: Python Environment: Databricks Runtime 11.1 Labels: Pandas PySpark (was: ) > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39926) Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
[ https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39926: Assignee: (was: Apache Spark) > Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans > --- > > Key: SPARK-39926 > URL: https://issues.apache.org/jira/browse/SPARK-39926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > How to reproduce: > {code:sql} > set spark.sql.parquet.enableVectorizedReader=false; > create table t(a int) using parquet; > insert into t values (42); > alter table t add column b int default 42; > insert into t values (43, null); > select * from t; > {code} > This should return two rows: > (42, 42) and (43, NULL) > But instead the scan misses the inserted NULL value, and returns the > existence DEFAULT value of "42" instead: > (42, 42) and (43, 42). > > This bug happens because the Parquet API calls one of these set* methods in > ParquetRowConverter.scala whenever it finds a non-NULL value: > {code:scala} > private class RowUpdater(row: InternalRow, ordinal: Int) > extends ParentContainerUpdater { > override def set(value: Any): Unit = row(ordinal) = value > override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, > value) > override def setByte(value: Byte): Unit = row.setByte(ordinal, value) > override def setShort(value: Short): Unit = row.setShort(ordinal, value) > override def setInt(value: Int): Unit = row.setInt(ordinal, value) > override def setLong(value: Long): Unit = row.setLong(ordinal, value) > override def setDouble(value: Double): Unit = row.setDouble(ordinal, value) > override def setFloat(value: Float): Unit = row.setFloat(ordinal, value) > } > {code} > > But it never calls anything like "setNull()" when encountering a NULL value. > To fix the bug, we need to know how many columns of data were present in each > row of the Parquet data, so we can differentiate between a NULL value and a > missing column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39926) Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
[ https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39926: Assignee: Apache Spark > Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans > --- > > Key: SPARK-39926 > URL: https://issues.apache.org/jira/browse/SPARK-39926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > > How to reproduce: > {code:sql} > set spark.sql.parquet.enableVectorizedReader=false; > create table t(a int) using parquet; > insert into t values (42); > alter table t add column b int default 42; > insert into t values (43, null); > select * from t; > {code} > This should return two rows: > (42, 42) and (43, NULL) > But instead the scan misses the inserted NULL value, and returns the > existence DEFAULT value of "42" instead: > (42, 42) and (43, 42). > > This bug happens because the Parquet API calls one of these set* methods in > ParquetRowConverter.scala whenever it finds a non-NULL value: > {code:scala} > private class RowUpdater(row: InternalRow, ordinal: Int) > extends ParentContainerUpdater { > override def set(value: Any): Unit = row(ordinal) = value > override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, > value) > override def setByte(value: Byte): Unit = row.setByte(ordinal, value) > override def setShort(value: Short): Unit = row.setShort(ordinal, value) > override def setInt(value: Int): Unit = row.setInt(ordinal, value) > override def setLong(value: Long): Unit = row.setLong(ordinal, value) > override def setDouble(value: Double): Unit = row.setDouble(ordinal, value) > override def setFloat(value: Float): Unit = row.setFloat(ordinal, value) > } > {code} > > But it never calls anything like "setNull()" when encountering a NULL value. > To fix the bug, we need to know how many columns of data were present in each > row of the Parquet data, so we can differentiate between a NULL value and a > missing column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39926) Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
[ https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579142#comment-17579142 ] Apache Spark commented on SPARK-39926: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/37501 > Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans > --- > > Key: SPARK-39926 > URL: https://issues.apache.org/jira/browse/SPARK-39926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > How to reproduce: > {code:sql} > set spark.sql.parquet.enableVectorizedReader=false; > create table t(a int) using parquet; > insert into t values (42); > alter table t add column b int default 42; > insert into t values (43, null); > select * from t; > {code} > This should return two rows: > (42, 42) and (43, NULL) > But instead the scan misses the inserted NULL value, and returns the > existence DEFAULT value of "42" instead: > (42, 42) and (43, 42). > > This bug happens because the Parquet API calls one of these set* methods in > ParquetRowConverter.scala whenever it finds a non-NULL value: > {code:scala} > private class RowUpdater(row: InternalRow, ordinal: Int) > extends ParentContainerUpdater { > override def set(value: Any): Unit = row(ordinal) = value > override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, > value) > override def setByte(value: Byte): Unit = row.setByte(ordinal, value) > override def setShort(value: Short): Unit = row.setShort(ordinal, value) > override def setInt(value: Int): Unit = row.setInt(ordinal, value) > override def setLong(value: Long): Unit = row.setLong(ordinal, value) > override def setDouble(value: Double): Unit = row.setDouble(ordinal, value) > override def setFloat(value: Float): Unit = row.setFloat(ordinal, value) > } > {code} > > But it never calls anything like "setNull()" when encountering a NULL value. > To fix the bug, we need to know how many columns of data were present in each > row of the Parquet data, so we can differentiate between a NULL value and a > missing column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1) {code} A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced. was: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1) {code} A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1) {code} A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced. was: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1){code} > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1) {code} > > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Summary: pyspark.pandas .apply() changing rows ordering (was: pyspark.pandas .apply() chaging rows ordering) > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() chaging rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} def example_func(df_col): return df_col ** 2 df['row_to_apply_function'] = df.apply(lambda row: example_func(row['row_to_apply_function']), axis=1){code} was: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} df['row_to_apply_function'] = df.apply(lambda row: func(row['row_to_apply_function']), axis=1){code} > pyspark.pandas .apply() chaging rows ordering > - > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['row_to_apply_function'] = df.apply(lambda row: > example_func(row['row_to_apply_function']), axis=1){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() chaging rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} df['row_to_apply_function'] = df.apply(lambda row: func(row['row_to_apply_function']), axis=1){code} > pyspark.pandas .apply() chaging rows ordering > - > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > > A command like this: > {code:java} > df['row_to_apply_function'] = df.apply(lambda row: > func(row['row_to_apply_function']), axis=1){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() chaging rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Description: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} df['row_to_apply_function'] = df.apply(lambda row: func(row['row_to_apply_function']), axis=1){code} was: When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering. A command like this: {code:java} df['row_to_apply_function'] = df.apply(lambda row: func(row['row_to_apply_function']), axis=1){code} > pyspark.pandas .apply() chaging rows ordering > - > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > df['row_to_apply_function'] = df.apply(lambda row: > func(row['row_to_apply_function']), axis=1){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
[ https://issues.apache.org/jira/browse/SPARK-40049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40049: Assignee: Apache Spark > Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite > -- > > Key: SPARK-40049 > URL: https://issues.apache.org/jira/browse/SPARK-40049 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Minor > > Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that > adaptive query execution is turned off. We should add cases > `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
[ https://issues.apache.org/jira/browse/SPARK-40049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579138#comment-17579138 ] Apache Spark commented on SPARK-40049: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37500 > Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite > -- > > Key: SPARK-40049 > URL: https://issues.apache.org/jira/browse/SPARK-40049 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that > adaptive query execution is turned off. We should add cases > `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
[ https://issues.apache.org/jira/browse/SPARK-40049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40049: Assignee: (was: Apache Spark) > Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite > -- > > Key: SPARK-40049 > URL: https://issues.apache.org/jira/browse/SPARK-40049 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that > adaptive query execution is turned off. We should add cases > `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40063) pyspark.pandas .apply() chaging rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Rossini Castro updated SPARK-40063: --- Summary: pyspark.pandas .apply() chaging rows ordering (was: pyspark.pandas .apply() chaging rows order) > pyspark.pandas .apply() chaging rows ordering > - > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 >Reporter: Marcelo Rossini Castro >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40063) pyspark.pandas .apply() chaging rows order
Marcelo Rossini Castro created SPARK-40063: -- Summary: pyspark.pandas .apply() chaging rows order Key: SPARK-40063 URL: https://issues.apache.org/jira/browse/SPARK-40063 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.3.0 Reporter: Marcelo Rossini Castro -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40062) Spark - Creating Sub Folder while writing to Partitioned Hive Table
dinesh created SPARK-40062: -- Summary: Spark - Creating Sub Folder while writing to Partitioned Hive Table Key: SPARK-40062 URL: https://issues.apache.org/jira/browse/SPARK-40062 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.7 Reporter: dinesh We had been writing to a Partitioned Hive Table and realized that data is being written has sub-folder. For ex- Refer Table definition as below - _Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/org=employee';_ While writing to table HDFS path being written looks something like this - {_}/mytable/a/b/c/org=employee/{_}{_}process_date=20220812/{_}{color:#de350b}_org=employee_{color} The unnecessary addition of _org=employee_ after process_date partition is because Hive Table has location consisting "=" operator, which Hive uses as syntax to determine partition column. Re-defining Table resolves above problem - _Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/employee';_ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40061) Document cast of ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40061: Assignee: Max Gekk (was: Apache Spark) > Document cast of ANSI intervals > --- > > Key: SPARK-40061 > URL: https://issues.apache.org/jira/browse/SPARK-40061 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Update the doc page > https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast > regarding cast of ANSI intervals to/from decimals/integrals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40061) Document cast of ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579096#comment-17579096 ] Apache Spark commented on SPARK-40061: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37495 > Document cast of ANSI intervals > --- > > Key: SPARK-40061 > URL: https://issues.apache.org/jira/browse/SPARK-40061 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Update the doc page > https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast > regarding cast of ANSI intervals to/from decimals/integrals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40061) Document cast of ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40061: Assignee: Apache Spark (was: Max Gekk) > Document cast of ANSI intervals > --- > > Key: SPARK-40061 > URL: https://issues.apache.org/jira/browse/SPARK-40061 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Update the doc page > https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast > regarding cast of ANSI intervals to/from decimals/integrals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40061) Document cast of ANSI intervals
Max Gekk created SPARK-40061: Summary: Document cast of ANSI intervals Key: SPARK-40061 URL: https://issues.apache.org/jira/browse/SPARK-40061 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Update the doc page https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast regarding cast of ANSI intervals to/from decimals/integrals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZiyueGuan updated SPARK-40058: -- Affects Version/s: 3.4.0 (was: 3.2.2) > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: ZiyueGuan >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40060: Assignee: Apache Spark > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40060: Assignee: (was: Apache Spark) > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579058#comment-17579058 ] Apache Spark commented on SPARK-40060: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37499 > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40060) Add numberDecommissioningExecutors metric
Zhongwei Zhu created SPARK-40060: Summary: Add numberDecommissioningExecutors metric Key: SPARK-40060 URL: https://issues.apache.org/jira/browse/SPARK-40060 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40054) Restore the error handling syntax of try_cast()
[ https://issues.apache.org/jira/browse/SPARK-40054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40054. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37486 [https://github.com/apache/spark/pull/37486] > Restore the error handling syntax of try_cast() > --- > > Key: SPARK-40054 > URL: https://issues.apache.org/jira/browse/SPARK-40054 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > For the following query > {code:java} > SET spark.sql.ansi.enabled=true; > SELECT try_cast(1/0 AS string); {code} > Spark 3.3 will throw an exception for the division by zero error. In current > master branch, it returns null after the refactoring PR > https://github.com/apache/spark/pull/36703 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579047#comment-17579047 ] Apache Spark commented on SPARK-40058: -- User 'guanziyue' has created a pull request for this issue: https://github.com/apache/spark/pull/37498 > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ZiyueGuan >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40058: Assignee: (was: Apache Spark) > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ZiyueGuan >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40058: Assignee: Apache Spark > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ZiyueGuan >Assignee: Apache Spark >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579048#comment-17579048 ] Apache Spark commented on SPARK-40058: -- User 'guanziyue' has created a pull request for this issue: https://github.com/apache/spark/pull/37498 > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ZiyueGuan >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40056: Assignee: BingKun Pan > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40056: - Priority: Trivial (was: Minor) > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40056. -- Resolution: Fixed Issue resolved by pull request 37489 [https://github.com/apache/spark/pull/37489] > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579013#comment-17579013 ] Apache Spark commented on SPARK-40020: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37497 > centralize the code of qualifying identifiers in SessionCatalog > --- > > Key: SPARK-40020 > URL: https://issues.apache.org/jira/browse/SPARK-40020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579014#comment-17579014 ] Apache Spark commented on SPARK-40020: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37497 > centralize the code of qualifying identifiers in SessionCatalog > --- > > Key: SPARK-40020 > URL: https://issues.apache.org/jira/browse/SPARK-40020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40059) Row indexes can overshadow user-created data
Ala Luszczak created SPARK-40059: Summary: Row indexes can overshadow user-created data Key: SPARK-40059 URL: https://issues.apache.org/jira/browse/SPARK-40059 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Ala Luszczak https://github.com/apache/spark/pull/37228 introduces ability to compute row indexes, which users can access through `_metadata.row_index` column. Internally this is achieved with the help of an extra column `_tmp_metadata_row_index`. When present in the schema sent to parquet reader, the reader populates it with row indexes, and the values are later placed in the `_metadata` struct. While relatively unlikely, it's still possible, that a user might want to include column `_tmp_metadata_row_index` in their data. In such scenario, the column will be populated with row indexes, rather than data read from the file. For repro, search `FileMetadataStructRowIndexSuite.scala` for this Jira ticket number. We could introduce some kind of countermeasure to handle this scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZiyueGuan updated SPARK-40058: -- Component/s: Spark Core (was: SQL) > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: ZiyueGuan >Priority: Minor > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40057. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37492 [https://github.com/apache/spark/pull/37492] > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40057: Assignee: Yikun Jiang > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578973#comment-17578973 ] Apache Spark commented on SPARK-39887: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37496 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40013) DS V2 expressions should have the default toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40013: Assignee: Apache Spark > DS V2 expressions should have the default toString > -- > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40013) DS V2 expressions should have the default toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40013: Assignee: (was: Apache Spark) > DS V2 expressions should have the default toString > -- > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40013) DS V2 expressions should have the default toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40013: Assignee: Apache Spark > DS V2 expressions should have the default toString > -- > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40013) DS V2 expressions should have the default toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578936#comment-17578936 ] Apache Spark commented on SPARK-40013: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37494 > DS V2 expressions should have the default toString > -- > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40013) DS V2 expressions should have the default toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40013: --- Summary: DS V2 expressions should have the default toString (was: DS V2 expressions should have the default implementation of toString) > DS V2 expressions should have the default toString > -- > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40013) DS V2 expressions should have the default implementation of toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng reopened SPARK-40013: > DS V2 expressions should have the default implementation of toString > > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40014) Support cast of decimals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40014. -- Resolution: Fixed Issue resolved by pull request 37466 [https://github.com/apache/spark/pull/37466] > Support cast of decimals to ANSI intervals > -- > > Key: SPARK-40014 > URL: https://issues.apache.org/jira/browse/SPARK-40014 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Support casts of decimal to ANSI intervals, and preserve the fractional parts > of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40014) Support cast of decimals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40014: Assignee: Max Gekk > Support cast of decimals to ANSI intervals > -- > > Key: SPARK-40014 > URL: https://issues.apache.org/jira/browse/SPARK-40014 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Support casts of decimal to ANSI intervals, and preserve the fractional parts > of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40058) Avoid filter twice in HadoopFSUtils
ZiyueGuan created SPARK-40058: - Summary: Avoid filter twice in HadoopFSUtils Key: SPARK-40058 URL: https://issues.apache.org/jira/browse/SPARK-40058 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.2 Reporter: ZiyueGuan In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive method call. This may waste more time when filter logic is heavy. Would like to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578901#comment-17578901 ] Daniel Darabos commented on SPARK-37690: It's fixed in Spark 3.3.0. (https://github.com/apache/spark/commit/1d068cef38f2323967be83045118cef0e537e8dc) Does upgrading count as a workaround? Or on 3.2 you can avoid the cycle error by saving the new table under a new name. > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> > `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40057: Assignee: Apache Spark > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578886#comment-17578886 ] Apache Spark commented on SPARK-40057: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37492 > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578885#comment-17578885 ] Apache Spark commented on SPARK-40057: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37492 > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40057) Cleanup "" in doctest
[ https://issues.apache.org/jira/browse/SPARK-40057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40057: Assignee: (was: Apache Spark) > Cleanup "" in doctest > > > Key: SPARK-40057 > URL: https://issues.apache.org/jira/browse/SPARK-40057 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40057) Cleanup "" in doctest
Yikun Jiang created SPARK-40057: --- Summary: Cleanup "" in doctest Key: SPARK-40057 URL: https://issues.apache.org/jira/browse/SPARK-40057 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Yikun Jiang https://github.com/apache/spark/pull/37465#discussion_r943080421 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578855#comment-17578855 ] Apache Spark commented on SPARK-39887: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37491 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org