[GitHub] [spark] MaxGekk opened a new pull request, #38027: [WIP][SPARK-40540][SQL] Migrate compilation errors onto error classes: _LEGACY_ERROR_TEMP_1200-1299

2022-09-28 Thread GitBox
MaxGekk opened a new pull request, #38027: URL: https://github.com/apache/spark/pull/38027 ### What changes were proposed in this pull request? In the PR, I propose to migrate 100 compilation errors onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_12xx`. The error message

[GitHub] [spark] LuciferYang opened a new pull request, #38028: [SPARK-40435][SQL][TESTS][FOLLOWUP] Correct test precondition of `PythonUDFSuite` and `ContinuousSuite`

2022-09-28 Thread GitBox
LuciferYang opened a new pull request, #38028: URL: https://github.com/apache/spark/pull/38028 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/37894 changed the preconditions for the following two tests from

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982063973 ## connect/src/main/scala/org/apache/spark/sql/catalyst/connect/connect.scala: ## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [spark] itholic commented on a diff in pull request #38015: [SPARK-40577][PS] Fix `CategoricalIndex.append` to match pandas 1.5.0

2022-09-28 Thread GitBox
itholic commented on code in PR #38015: URL: https://github.com/apache/spark/pull/38015#discussion_r982276182 ## python/pyspark/pandas/indexes/base.py: ## @@ -1907,6 +1908,9 @@ def append(self, other: "Index") -> "Index": ) index_fields =

[GitHub] [spark] itholic commented on a diff in pull request #38015: [SPARK-40577][PS] Fix `CategoricalIndex.append` to match pandas 1.5.0

2022-09-28 Thread GitBox
itholic commented on code in PR #38015: URL: https://github.com/apache/spark/pull/38015#discussion_r982276182 ## python/pyspark/pandas/indexes/base.py: ## @@ -1907,6 +1908,9 @@ def append(self, other: "Index") -> "Index": ) index_fields =

[GitHub] [spark] mridulm commented on pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
mridulm commented on PR #38030: URL: https://github.com/apache/spark/pull/38030#issuecomment-1260850742 How/where are we populating the message ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] srowen commented on a diff in pull request #38024: [SPARK-40591][CORE][SQL] Fix data loss caused by ignoreCorruptFiles

2022-09-28 Thread GitBox
srowen commented on code in PR #38024: URL: https://github.com/apache/spark/pull/38024#discussion_r982356825 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1078,6 +1078,13 @@ package object config { .booleanConf .createWithDefault(false)

[GitHub] [spark] AngersZhuuuu commented on pull request #35799: [SPARK-38498][STREAM] Support customized StreamingListener by configuration

2022-09-28 Thread GitBox
AngersZh commented on PR #35799: URL: https://github.com/apache/spark/pull/35799#issuecomment-1260559187 @cloud-fan @dongjoon-hyun How about this one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] LuciferYang commented on pull request #38028: [SPARK-40435][SQL][TESTS][FOLLOWUP] Correct test precondition of `PythonUDFSuite` and `ContinuousSuite`

2022-09-28 Thread GitBox
LuciferYang commented on PR #38028: URL: https://github.com/apache/spark/pull/38028#issuecomment-1260522666 cc @HeartSaVioR and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan opened a new pull request, #38029: [SPARK-40595][SQL] Improve error message for unused CTE relations

2022-09-28 Thread GitBox
cloud-fan opened a new pull request, #38029: URL: https://github.com/apache/spark/pull/38029 ### What changes were proposed in this pull request? In `CheckAnalysis`, we inline CTE relations first and then check the plan. This causes an issue if the CTE relation is not used,

[GitHub] [spark] cloud-fan commented on pull request #38029: [SPARK-40595][SQL] Improve error message for unused CTE relations

2022-09-28 Thread GitBox
cloud-fan commented on PR #38029: URL: https://github.com/apache/spark/pull/38029#issuecomment-1260569657 cc @MaxGekk @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982215819 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982352630 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982038430 ## connect/src/main/scala/org/apache/spark/sql/catalyst/connect/connect.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982067988 ## connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] zhengruifeng closed pull request #38026: [SPARK-40592][PS] Implement `min_count` in `GroupBy.max`

2022-09-28 Thread GitBox
zhengruifeng closed pull request #38026: [SPARK-40592][PS] Implement `min_count` in `GroupBy.max` URL: https://github.com/apache/spark/pull/38026 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng commented on pull request #38026: [SPARK-40592][PS] Implement `min_count` in `GroupBy.max`

2022-09-28 Thread GitBox
zhengruifeng commented on PR #38026: URL: https://github.com/apache/spark/pull/38026#issuecomment-1260564475 Merged into master, thanks @HyukjinKwon for reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] itholic commented on a diff in pull request #38031: [SPARK-40589][PS][TEST] Fix test for `DataFrame.corr_with` skip the pandas regression

2022-09-28 Thread GitBox
itholic commented on code in PR #38031: URL: https://github.com/apache/spark/pull/38031#discussion_r982260594 ## python/pyspark/pandas/tests/test_dataframe.py: ## @@ -6076,7 +6076,13 @@ def test_corrwith(self): def _test_corrwith(self, psdf, psobj): pdf =

[GitHub] [spark] itholic opened a new pull request, #38033: [SPARK-40598][PS] Fix plotting features work properly with pandas 1.5.0.

2022-09-28 Thread GitBox
itholic opened a new pull request, #38033: URL: https://github.com/apache/spark/pull/38033 ### What changes were proposed in this pull request? This PR proposes to fix the plotting functions working properly with pandas 1.5.0. This includes two fixes: - Fix the

[GitHub] [spark] cloud-fan commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982037071 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java: ## @@ -45,4 +45,18 @@ public interface LogicalWriteInfo { * the schema

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982067325 ## connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] mridulm commented on pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
mridulm commented on PR #38030: URL: https://github.com/apache/spark/pull/38030#issuecomment-1260855604 Ok, so this is mainly to propagate in `SparkListenerExecutorRemoved`, sounds reasonable. +CC @dongjoon-hyun -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] yaooqinn commented on a diff in pull request #38024: [SPARK-40591][SQL] Fix data loss caused by ignoreCorruptFiles

2022-09-28 Thread GitBox
yaooqinn commented on code in PR #38024: URL: https://github.com/apache/spark/pull/38024#discussion_r982003775 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala: ## @@ -36,8 +36,15 @@ class FilePartitionReader[T]( private

[GitHub] [spark] cloud-fan commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982048790 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/LogicalWriteInfoImpl.scala: ## @@ -23,4 +23,6 @@ import

[GitHub] [spark] HeartSaVioR commented on pull request #38013: [SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState

2022-09-28 Thread GitBox
HeartSaVioR commented on PR #38013: URL: https://github.com/apache/spark/pull/38013#issuecomment-1260507874 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan commented on a diff in pull request #38023: [SPARK-40587][CONNECT] Support SELECT * in an explicit way in connect proto

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #38023: URL: https://github.com/apache/spark/pull/38023#discussion_r982071397 ## connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -96,7 +96,9 @@ class SparkConnectPlanner(plan: proto.Relation,

[GitHub] [spark] LuciferYang commented on pull request #37654: [SPARK-40216][SQL] Extract common `ParquetUtils.prepareWrite` method to deduplicate code in `ParquetFileFormat` and `ParquetWrite`

2022-09-28 Thread GitBox
LuciferYang commented on PR #37654: URL: https://github.com/apache/spark/pull/37654#issuecomment-1260459690 Thanks @cloud-fan @sadikovi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982053631 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java: ## @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982062517 ## connect/src/main/scala/org/apache/spark/sql/catalyst/connect/connect.scala: ## @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [spark] AngersZhuuuu commented on pull request #35594: [SPARK-38270][SQL] Spark SQL CLI's AM should keep same exit code with client side

2022-09-28 Thread GitBox
AngersZh commented on PR #35594: URL: https://github.com/apache/spark/pull/35594#issuecomment-1260558788 > @AngersZh can you rebase? We should merge this PR. Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] LuciferYang commented on pull request #37844: [SPARK-40511][BUILD][CORE] Upgrade slf4j to 2.0.2

2022-09-28 Thread GitBox
LuciferYang commented on PR #37844: URL: https://github.com/apache/spark/pull/37844#issuecomment-1260663996 mvn test `Java8 + hadoop-2 profile` with this pr , all test passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] bozhang2820 opened a new pull request, #38030: [SPARK-40596] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
bozhang2820 opened a new pull request, #38030: URL: https://github.com/apache/spark/pull/38030 ### What changes were proposed in this pull request? This change populates `ExecutorDecommission` with messages in `ExecutorDecommissionInfo`. ### Why are the changes needed?

[GitHub] [spark] yaooqinn opened a new pull request, #38032: [WIP][SPARK-40597][CORE] local mode should respect TASK_MAX_FAILURES like all other cluster managers

2022-09-28 Thread GitBox
yaooqinn opened a new pull request, #38032: URL: https://github.com/apache/spark/pull/38032 ### What changes were proposed in this pull request? The local modes w/o explicitly num of task failures option specified is currently hard coded to 1. The resilience in

[GitHub] [spark] HeartSaVioR closed pull request #38013: [SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState

2022-09-28 Thread GitBox
HeartSaVioR closed pull request #38013: [SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState URL: https://github.com/apache/spark/pull/38013 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] itholic opened a new pull request, #38031: [SPARK-40589][PS][TEST] Fix test for `DataFrame.corr_with` skip the pandas regression

2022-09-28 Thread GitBox
itholic opened a new pull request, #38031: URL: https://github.com/apache/spark/pull/38031 ### What changes were proposed in this pull request? This PR proposes to skip the `DataFrame.corr_with` test when the `other` is `pyspark.pandas.Series` and the `method` is

[GitHub] [spark] mridulm commented on pull request #38024: [SPARK-40591][CORE][SQL] Fix data loss caused by ignoreCorruptFiles

2022-09-28 Thread GitBox
mridulm commented on PR #38024: URL: https://github.com/apache/spark/pull/38024#issuecomment-1260859431 I am a bit confused with what this PR is trying to do. If we want to ignore corrupt files, by definition failures will be ignored - and tasks will be marked successful : because that

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982503768 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -618,6 +618,46 @@ pivotValue : expression (AS? identifier)? ;

[GitHub] [spark] EnricoMi commented on pull request #38036: [SPARK-40601] Assert key size when cogrouping groups

2022-09-28 Thread GitBox
EnricoMi commented on PR #38036: URL: https://github.com/apache/spark/pull/38036#issuecomment-1260959605 Ideally, `EnsureRequirements` should not call into `HashShuffleSpec.createPartitioning(clustering)` with a `clustering` that has an incompatible cardinality. -- This is an automated

[GitHub] [spark] EnricoMi commented on pull request #38036: [SPARK-40601] Assert key size when cogrouping groups

2022-09-28 Thread GitBox
EnricoMi commented on PR #38036: URL: https://github.com/apache/spark/pull/38036#issuecomment-1260961186 @HyukjinKwon @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] peter-toth opened a new pull request, #38038: [WIP][SQL] Refactor BroadcastHashJoinExec output partitioning calculation

2022-09-28 Thread GitBox
peter-toth opened a new pull request, #38038: URL: https://github.com/apache/spark/pull/38038 ### What changes were proposed in this pull request? This is a WIP PR to refactor `BroadcastHashJoinExec` output partitioning calculation. As this PR is based on

[GitHub] [spark] cloud-fan opened a new pull request, #38039: [SPARK-40603][SQL] Throw the original error from catalog implementations

2022-09-28 Thread GitBox
cloud-fan opened a new pull request, #38039: URL: https://github.com/apache/spark/pull/38039 ### What changes were proposed in this pull request? Currently, Spark swallows the error thrown by catalog implementations, and re-throws a standard error. However, the original error

[GitHub] [spark] cloud-fan commented on pull request #38039: [SPARK-40603][SQL] Throw the original error from catalog implementations

2022-09-28 Thread GitBox
cloud-fan commented on PR #38039: URL: https://github.com/apache/spark/pull/38039#issuecomment-1261135792 cc @MaxGekk @srielau @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] EnricoMi opened a new pull request, #38036: [SPARK-40601] Assert key size when cogrouping groups

2022-09-28 Thread GitBox
EnricoMi opened a new pull request, #38036: URL: https://github.com/apache/spark/pull/38036 Cogrouping two grouped DataFrames in PySpark that have different group key cardinalities raises an error that is not very descriptive: ``` py4j.protocol.Py4JJavaError: An error occurred

[GitHub] [spark] mridulm commented on pull request #37779: [SPARK-40320][Core] Executor should exit when initialization failed for fatal error

2022-09-28 Thread GitBox
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1260865571 Can we also add the example code you had to reproduce the issue as a test ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982381621 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] peter-toth opened a new pull request, #38034: [SPARK-40599][SQL] Add multiTransform methods to TreeNode to generate alternatives

2022-09-28 Thread GitBox
peter-toth opened a new pull request, #38034: URL: https://github.com/apache/spark/pull/38034 ### What changes were proposed in this pull request? This PR introduce `TreeNode.multiTransform()` methods to be able to recursively transform a `TreeNode` (and so a tree) into multiple

[GitHub] [spark] Kimahriman commented on a diff in pull request #37770: [SPARK-40314][SQL][PYTHON] Add scala and python bindings for inline and inline_outer

2022-09-28 Thread GitBox
Kimahriman commented on code in PR #37770: URL: https://github.com/apache/spark/pull/37770#discussion_r982400522 ## sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala: ## @@ -219,20 +219,21 @@ class GeneratorFunctionSuite extends QueryTest with

[GitHub] [spark] peter-toth opened a new pull request, #38035: [WIP][SQL] Improve constraint generation

2022-09-28 Thread GitBox
peter-toth opened a new pull request, #38035: URL: https://github.com/apache/spark/pull/38035 ### What changes were proposed in this pull request? This is a WIP PR to improve constraint generation with the help of `TreeNode.multiTransform()`. As this PR is based on

[GitHub] [spark] grundprinzip opened a new pull request, #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
grundprinzip opened a new pull request, #38037: URL: https://github.com/apache/spark/pull/38037 ### What changes were proposed in this pull request? This patch adds the missing type annotations for the Spark Connect Python client and renables the mypy checks. In addition, the

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982503768 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -618,6 +618,46 @@ pivotValue : expression (AS? identifier)? ;

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982517306 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] grundprinzip commented on a diff in pull request #38023: [SPARK-40587][CONNECT] Support SELECT * in an explicit way in connect proto

2022-09-28 Thread GitBox
grundprinzip commented on code in PR #38023: URL: https://github.com/apache/spark/pull/38023#discussion_r982536156 ## connect/src/main/protobuf/spark/connect/expressions.proto: ## @@ -155,4 +156,7 @@ message Expression { string expression = 1; } + // represent *

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982631012 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982631649 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] amaliujia commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r982632002 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -2635,6 +2635,10 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with

[GitHub] [spark] mridulm commented on pull request #38032: [WIP][SPARK-40597][CORE] local mode should respect TASK_MAX_FAILURES like all other cluster managers

2022-09-28 Thread GitBox
mridulm commented on PR #38032: URL: https://github.com/apache/spark/pull/38032#issuecomment-1260862926 The reason to retry on failures is the inherent nature of distributed computation - which does not apply in local mode. What is the scenario where we are looking for this change to be

[GitHub] [spark] peter-toth commented on pull request #38034: [SPARK-40599][SQL] Add multiTransform methods to TreeNode to generate alternatives

2022-09-28 Thread GitBox
peter-toth commented on PR #38034: URL: https://github.com/apache/spark/pull/38034#issuecomment-1261011468 I've opened 3 WIP PRs to demonstrate the usage of `multiTransform()`: - A bug fix to omprove AliasAwareOutputPartitioning to take all aliases into account:

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982382485 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982382485 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-28 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r982515423 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] LucaCanali commented on pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Python UDFs using SQL metrics

2022-09-28 Thread GitBox
LucaCanali commented on PR #33559: URL: https://github.com/apache/spark/pull/33559#issuecomment-1261280991 The issue with SQLQueryTestSuite.udf/postgreSQL/udf-aggregates_part3.sql should be fixed now. I have also extended the instrumentation to applyInPandasWithState recently introduced

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982741016 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java: ## @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [spark] holdenk commented on a diff in pull request #37885: [SPARK-40428][CORE][WIP] Fix shutdown hook in the CoarseGrainedSchedulerBackend

2022-09-28 Thread GitBox
holdenk commented on code in PR #37885: URL: https://github.com/apache/spark/pull/37885#discussion_r982829203 ## core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala: ## @@ -971,18 +971,30 @@ private[spark] class TaskSchedulerImpl( } override def

[GitHub] [spark] itholic commented on pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-28 Thread GitBox
itholic commented on PR #38018: URL: https://github.com/apache/spark/pull/38018#issuecomment-1261583694 Yeah, and more specifically, the "pandas-on-Spark" is used when the another noun follows right after "pandas API on Spark". For example, "pandas API on Spark DataFrame is

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-28 Thread GitBox
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r982725060 ## connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982723941 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java: ## @@ -45,4 +45,18 @@ public interface LogicalWriteInfo { * the schema

[GitHub] [spark] sadikovi commented on a diff in pull request #37654: [SPARK-40216][SQL] Extract common `ParquetUtils.prepareWrite` method to deduplicate code in `ParquetFileFormat` and `ParquetWrite

2022-09-28 Thread GitBox
sadikovi commented on code in PR #37654: URL: https://github.com/apache/spark/pull/37654#discussion_r982881638 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -72,87 +69,9 @@ class ParquetFileFormat job: Job,

[GitHub] [spark] sadikovi commented on a diff in pull request #37654: [SPARK-40216][SQL] Extract common `ParquetUtils.prepareWrite` method to deduplicate code in `ParquetFileFormat` and `ParquetWrite

2022-09-28 Thread GitBox
sadikovi commented on code in PR #37654: URL: https://github.com/apache/spark/pull/37654#discussion_r982881822 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -72,87 +69,9 @@ class ParquetFileFormat job: Job,

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982943887 ## connect/dev/generate_protos.sh: ## @@ -0,0 +1,79 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements.

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982943887 ## connect/dev/generate_protos.sh: ## @@ -0,0 +1,79 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements.

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982876885 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import pandas

[GitHub] [spark] grundprinzip commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
grundprinzip commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982910120 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import

[GitHub] [spark] MaxGekk commented on pull request #38029: [SPARK-40595][SQL] Improve error message for unused CTE relations

2022-09-28 Thread GitBox
MaxGekk commented on PR #38029: URL: https://github.com/apache/spark/pull/38029#issuecomment-1261301966 +1, LGTM. Merging to master. Thank you, @cloud-fan and @amaliujia for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] MaxGekk closed pull request #38029: [SPARK-40595][SQL] Improve error message for unused CTE relations

2022-09-28 Thread GitBox
MaxGekk closed pull request #38029: [SPARK-40595][SQL] Improve error message for unused CTE relations URL: https://github.com/apache/spark/pull/38029 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982876885 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import pandas

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982876885 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import pandas

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982912553 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import pandas

[GitHub] [spark] rdblue commented on pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands

2022-09-28 Thread GitBox
rdblue commented on PR #36304: URL: https://github.com/apache/spark/pull/36304#issuecomment-1261555810 I talked with @aokolnychyi about this and I think this is a data source problem, not something Spark should track right now. The main problem is that some table sources have

[GitHub] [spark] allisonwang-db commented on a diff in pull request #37641: [SPARK-40201][SQL][TESTS] Improve v1 write test coverage

2022-09-28 Thread GitBox
allisonwang-db commented on code in PR #37641: URL: https://github.com/apache/spark/pull/37641#discussion_r982873072 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala: ## @@ -177,7 +171,15 @@ object FileFormatWriter extends Logging {

[GitHub] [spark] dongjoon-hyun commented on pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
dongjoon-hyun commented on PR #38030: URL: https://github.com/apache/spark/pull/38030#issuecomment-1261492253 Thank you, @mridulm . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
dongjoon-hyun commented on code in PR #38030: URL: https://github.com/apache/spark/pull/38030#discussion_r982884660 ## core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala: ## @@ -123,7 +123,8 @@ private[spark] object MapStatus { private[spark] class

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-28 Thread GitBox
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r982731665 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/LogicalWriteInfoImpl.scala: ## @@ -23,4 +23,6 @@ import

[GitHub] [spark] amaliujia commented on a diff in pull request #38037: [CONNECT][SPARK-40537] Enable mypy for Spark Connect Python Client

2022-09-28 Thread GitBox
amaliujia commented on code in PR #38037: URL: https://github.com/apache/spark/pull/38037#discussion_r982878444 ## python/pyspark/sql/connect/client.py: ## @@ -21,18 +21,20 @@ import typing import uuid -import grpc +import grpc # type: ignore import pandas import pandas

[GitHub] [spark] bozhang2820 commented on a diff in pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
bozhang2820 commented on code in PR #38030: URL: https://github.com/apache/spark/pull/38030#discussion_r982969089 ## core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala: ## @@ -123,7 +123,8 @@ private[spark] object MapStatus { private[spark] class

[GitHub] [spark] Ngone51 commented on a diff in pull request #38030: [SPARK-40596][CORE] Populate ExecutorDecommission with messages in ExecutorDecommissionInfo

2022-09-28 Thread GitBox
Ngone51 commented on code in PR #38030: URL: https://github.com/apache/spark/pull/38030#discussion_r983009468 ## core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala: ## @@ -186,6 +186,8 @@ class BlockManagerDecommissionIntegrationSuite

[GitHub] [spark] srowen commented on a diff in pull request #38024: [SPARK-40591][CORE][SQL] Fix data loss caused by ignoreCorruptFiles

2022-09-28 Thread GitBox
srowen commented on code in PR #38024: URL: https://github.com/apache/spark/pull/38024#discussion_r983008731 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala: ## @@ -36,8 +36,15 @@ class FilePartitionReader[T]( private def

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37995: [SPARK-40556][PS][SQL] Unpersist the intermediate datasets cached in `AttachDistributedSequenceExec`

2022-09-28 Thread GitBox
zhengruifeng commented on code in PR #37995: URL: https://github.com/apache/spark/pull/37995#discussion_r983016797 ## python/pyspark/pandas/series.py: ## @@ -6442,6 +6445,8 @@ def argmin(self, axis: Axis = None, skipna: bool = True) -> int: raise ValueError("axis

[GitHub] [spark] LuciferYang commented on a diff in pull request #38041: [SPARK-40605][CONNECT] Change to use `log4j2.properties` to configure test log output

2022-09-28 Thread GitBox
LuciferYang commented on code in PR #38041: URL: https://github.com/apache/spark/pull/38041#discussion_r983020028 ## connect/src/test/resources/log4j2.properties: ## @@ -0,0 +1,39 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] LuciferYang commented on pull request #38041: [SPARK-40605][CONNECT] Change to use `log4j2.properties` to configure test log output

2022-09-28 Thread GitBox
LuciferYang commented on PR #38041: URL: https://github.com/apache/spark/pull/38041#issuecomment-1261689871 I'm not sure whether this should belong to the subtask of SPARK-39375. If not, please help move it -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] wangyum commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan

2022-09-28 Thread GitBox
wangyum commented on PR #37996: URL: https://github.com/apache/spark/pull/37996#issuecomment-1261697253 Another advantage is that we can coalesce into smaller partitions and then build the bloom filter because large parallelism can not improve the performance of the build bloom filter. For

[GitHub] [spark] LuciferYang commented on pull request #37654: [SPARK-40216][SQL] Extract common `ParquetUtils.prepareWrite` method to deduplicate code in `ParquetFileFormat` and `ParquetWrite`

2022-09-28 Thread GitBox
LuciferYang commented on PR #37654: URL: https://github.com/apache/spark/pull/37654#issuecomment-1261594105 Thanks @sadikovi ~ rebase to keep the code up to date, let's wait for GA -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] github-actions[bot] closed pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases URL: https://github.com/apache/spark/pull/35549 -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] github-actions[bot] closed pull request #35548: [SPARK-38234] [SQL] [SS] Added structured streaming monitoring APIs.

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #35548: [SPARK-38234] [SQL] [SS] Added structured streaming monitoring APIs. URL: https://github.com/apache/spark/pull/35548 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] github-actions[bot] closed pull request #35371: [WIP][SPARK-37946][SQL] Use error classes in the execution errors related to partitions

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #35371: [WIP][SPARK-37946][SQL] Use error classes in the execution errors related to partitions URL: https://github.com/apache/spark/pull/35371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] github-actions[bot] closed pull request #35319: [SPARK-36571][SQL] Add new SQLPathHadoopMapReduceCommitProtocol resolve conflict when write into partition table's different partition

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #35319: [SPARK-36571][SQL] Add new SQLPathHadoopMapReduceCommitProtocol resolve conflict when write into partition table's different partition URL: https://github.com/apache/spark/pull/35319 -- This is an automated message from the Apache Git Service.

[GitHub] [spark] github-actions[bot] closed pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #35337: [SPARK-37840][SQL] Dynamic Update of UDF URL: https://github.com/apache/spark/pull/35337 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] github-actions[bot] closed pull request #34903: [SPARK-37650][PYTHON] Tell spark-env.sh the python interpreter

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #34903: [SPARK-37650][PYTHON] Tell spark-env.sh the python interpreter URL: https://github.com/apache/spark/pull/34903 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] github-actions[bot] closed pull request #34856: [SPARK-37602][CORE] Add config property to set default Spark listeners

2022-09-28 Thread GitBox
github-actions[bot] closed pull request #34856: [SPARK-37602][CORE] Add config property to set default Spark listeners URL: https://github.com/apache/spark/pull/34856 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] github-actions[bot] commented on pull request #34791: [SPARK-37528][SQL][CORE] Schedule Tasks By Input Size

2022-09-28 Thread GitBox
github-actions[bot] commented on PR #34791: URL: https://github.com/apache/spark/pull/34791#issuecomment-1261602244 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #34829: [SPARK-23607][CORE] Use HDFS extended attributes to store application summary information in SHS

2022-09-28 Thread GitBox
github-actions[bot] commented on PR #34829: URL: https://github.com/apache/spark/pull/34829#issuecomment-126160 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

  1   2   >