[GitHub] [spark] AmplabJenkins commented on pull request #36548: [SPARK-38470][CORE] Use error classes in org.apache.spark.partial

2022-05-14 Thread GitBox
AmplabJenkins commented on PR #36548: URL: https://github.com/apache/spark/pull/36548#issuecomment-1126812505 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] huaxingao commented on pull request #36521: [SPARK-39162][SQL] Jdbc dialect should decide which function could be pushed down

2022-05-14 Thread GitBox
huaxingao commented on PR #36521: URL: https://github.com/apache/spark/pull/36521#issuecomment-1126826707 Merged to master. Thanks! @beliefer I can't merge to 3.3 because there are conflicts. Could you please back port to 3.3? -- This is an automated message from the Apache Git

[GitHub] [spark] AmplabJenkins commented on pull request #36544: [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2

2022-05-14 Thread GitBox
AmplabJenkins commented on PR #36544: URL: https://github.com/apache/spark/pull/36544#issuecomment-1126831264 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-14 Thread GitBox
AmplabJenkins commented on PR #36545: URL: https://github.com/apache/spark/pull/36545#issuecomment-1126831255 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] HyukjinKwon commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36501: URL: https://github.com/apache/spark/pull/36501#issuecomment-1126833206 Test results are in https://github.com/dtenedor/spark/runs/6433699950. Seems like the sync went failed for some reasons. -- This is an automated message from the Apache Git

[GitHub] [spark] HyukjinKwon commented on pull request #36549: [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36549: URL: https://github.com/apache/spark/pull/36549#issuecomment-1126832926 Merged to master, branch-3.3 and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] zhengruifeng commented on pull request #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve the numerical stability of pandas-on-Spark's skewness

2022-05-14 Thread GitBox
zhengruifeng commented on PR #36554: URL: https://github.com/apache/spark/pull/36554#issuecomment-1126864262 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] github-actions[bot] commented on pull request #35357: [SPARK-21195][CORE] MetricSystem should pick up dynamically registered metrics in sources

2022-05-14 Thread GitBox
github-actions[bot] commented on PR #35357: URL: https://github.com/apache/spark/pull/35357#issuecomment-1126832112 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] HyukjinKwon commented on pull request #36267: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36267: URL: https://github.com/apache/spark/pull/36267#issuecomment-1126832061 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon closed pull request #36267: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors

2022-05-14 Thread GitBox
HyukjinKwon closed pull request #36267: [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors URL: https://github.com/apache/spark/pull/36267 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] HyukjinKwon closed pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

2022-05-14 Thread GitBox
HyukjinKwon closed pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences URL: https://github.com/apache/spark/pull/36546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] huaxingao closed pull request #36521: [SPARK-39162][SQL] Jdbc dialect should decide which function could be pushed down

2022-05-14 Thread GitBox
huaxingao closed pull request #36521: [SPARK-39162][SQL] Jdbc dialect should decide which function could be pushed down URL: https://github.com/apache/spark/pull/36521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] MaxGekk opened a new pull request, #36553: [WIP][SQL] Improve errors related to casts

2022-05-14 Thread GitBox
MaxGekk opened a new pull request, #36553: URL: https://github.com/apache/spark/pull/36553 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] HyukjinKwon commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36546: URL: https://github.com/apache/spark/pull/36546#issuecomment-1126832443 @bersprockets, it has a conflict with branch-3.1. Please create a PR to backport if you think it should be backported :-). -- This is an automated message from the Apache Git

[GitHub] [spark] HyukjinKwon commented on pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36546: URL: https://github.com/apache/spark/pull/36546#issuecomment-1126832378 Merged to master, branch-3.3 and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r873100372 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] zhengruifeng commented on pull request #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve the numerical stability of skewness

2022-05-14 Thread GitBox
zhengruifeng commented on PR #36554: URL: https://github.com/apache/spark/pull/36554#issuecomment-1126847198 befor this PR: ``` In [2]: pdf = pd.DataFrame( ...: { ...: "A": [1, 1, 1, 1, 1], ...: "B": [1.0,

[GitHub] [spark] zhengruifeng opened a new pull request, #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve the numerical stability of skewness

2022-05-14 Thread GitBox
zhengruifeng opened a new pull request, #36554: URL: https://github.com/apache/spark/pull/36554 ### What changes were proposed in this pull request? Improve the numerical stability of skewness for cases with small `m2` and `m3` ### Why are the changes needed? the

[GitHub] [spark] HyukjinKwon closed pull request #36549: [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas

2022-05-14 Thread GitBox
HyukjinKwon closed pull request #36549: [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas URL: https://github.com/apache/spark/pull/36549 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] AmplabJenkins commented on pull request #36540: [SPARK-38466][CORE] Use error classes in org.apache.spark.mapred

2022-05-14 Thread GitBox
AmplabJenkins commented on PR #36540: URL: https://github.com/apache/spark/pull/36540#issuecomment-1126840371 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-14 Thread GitBox
HyukjinKwon commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r872942079 ## python/pyspark/sql/tests/test_types.py: ## @@ -285,6 +285,64 @@ def test_infer_nested_dict_as_struct(self): df = self.spark.createDataFrame(data)

[GitHub] [spark] zhengruifeng commented on pull request #36549: [SPARK-39186][PYTHON] make skew consistent with pandas

2022-05-14 Thread GitBox
zhengruifeng commented on PR #36549: URL: https://github.com/apache/spark/pull/36549#issuecomment-1126676744 lastest master: ``` pdf = pd.DataFrame( { "A": [1, -2, np.nan, -4, 5], "B": [1.0, -2, np.nan, -4, 5], "C": [-6.0, -7, -8, np.nan,

[GitHub] [spark] zhengruifeng opened a new pull request, #36549: [SPARK-39186][PYTHON] make skew consistent with pandas

2022-05-14 Thread GitBox
zhengruifeng opened a new pull request, #36549: URL: https://github.com/apache/spark/pull/36549 ### What changes were proposed in this pull request? the logics of computing skewness are different between spark sql and pandas: spark sql: [`sqrt(n) * m3 / sqrt(m2 * m2 *

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

2022-05-14 Thread GitBox
HyukjinKwon commented on code in PR #36501: URL: https://github.com/apache/spark/pull/36501#discussion_r872943310 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala: ## @@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends

[GitHub] [spark] bjornjorgensen commented on pull request #36547: Implement `skipna` parameter of `Groupby.all`

2022-05-14 Thread GitBox
bjornjorgensen commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1126676101 [all](https://docs.python.org/3/library/functions.html#all) is a built in function in python. Can we rename this to `def all_to_skip()` -- This is an automated message from the

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] abhishekd0907 commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled

2022-05-14 Thread GitBox
abhishekd0907 commented on PR #35683: URL: https://github.com/apache/spark/pull/35683#issuecomment-1126682797 @mridulm @attilapiros I have handled all your comments. Can you please review the PR again? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] MaxGekk opened a new pull request, #36550: [SPARK-39187][SQL] Remove `SparkIllegalStateException`

2022-05-14 Thread GitBox
MaxGekk opened a new pull request, #36550: URL: https://github.com/apache/spark/pull/36550 ### What changes were proposed in this pull request? Remove `SparkIllegalStateException` and replace it by `IllegalStateException` where it was used. ### Why are the changes needed? To

[GitHub] [spark] MaxGekk commented on a diff in pull request #36550: [SPARK-39187][SQL] Remove `SparkIllegalStateException`

2022-05-14 Thread GitBox
MaxGekk commented on code in PR #36550: URL: https://github.com/apache/spark/pull/36550#discussion_r872960720 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -272,15 +271,6 @@ class QueryExecutionErrorsSuite } } -

[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36545: URL: https://github.com/apache/spark/pull/36545#issuecomment-1126657800 We should probably add a configuration like `spark.sql.pyspark.legacy.inferFirstElementInArray.enabled` (feel free to pick other names if you have other ideas). -- This is an

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872956648 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] beliefer commented on pull request #36516: [SPARK-39157][SQL] H2Dialect should override getJDBCType so as make the data type is correct

2022-05-14 Thread GitBox
beliefer commented on PR #36516: URL: https://github.com/apache/spark/pull/36516#issuecomment-1126682083 @cloud-fan Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng commented on pull request #36549: [SPARK-39186][PYTHON] make skew consistent with pandas

2022-05-14 Thread GitBox
zhengruifeng commented on PR #36549: URL: https://github.com/apache/spark/pull/36549#issuecomment-1126688194 cc @HyukjinKwon @xinrong-databricks @itholic should this be a bug-fix? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36545: URL: https://github.com/apache/spark/pull/36545#issuecomment-1126657160 Nice PR description. Yeah, we should probably add a configuration then, please also refer to https://github.com/apache/spark/commit/2537fe8cbaf49070137d4b5bc39af078b306c4c8 for

[GitHub] [spark] beliefer commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-14 Thread GitBox
beliefer commented on code in PR #36531: URL: https://github.com/apache/spark/pull/36531#discussion_r872962001 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -2117,7 +2265,9 @@ case class Cast( child: Expression, dataType:

[GitHub] [spark] panbingkun opened a new pull request, #36551: [SPARK-38463][CORE] Use error classes in org.apache.spark.input

2022-05-14 Thread GitBox
panbingkun opened a new pull request, #36551: URL: https://github.com/apache/spark/pull/36551 ## What changes were proposed in this pull request? This change is to refactor exceptions thrown in FixedLengthBinaryRecordReader to use error class framework. ### Why are the changes

[GitHub] [spark] panbingkun commented on pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes in the compilation errors of deserializer

2022-05-14 Thread GitBox
panbingkun commented on PR #36479: URL: https://github.com/apache/spark/pull/36479#issuecomment-1126696228 pinging @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] zhengruifeng commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-14 Thread GitBox
zhengruifeng commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r872972240 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] weixiuli commented on pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.

2022-05-14 Thread GitBox
weixiuli commented on PR #36162: URL: https://github.com/apache/spark/pull/36162#issuecomment-1126711880 @mridulm @Ngone51 Sorry for the late reply,please help me review if you have time. Thank you very much. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] wangyum opened a new pull request, #36552: [SPARK-38506][SQL] Push partial aggregation through join

2022-05-14 Thread GitBox
wangyum opened a new pull request, #36552: URL: https://github.com/apache/spark/pull/36552 ### What changes were proposed in this pull request? 1. Add a new optimizer rule(PushPartialAggregationThroughJoin) to push the partial aggregation through join. It supports the following

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-14 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r872964095 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -82,52 +82,45 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan]

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-14 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r872964108 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -82,52 +82,45 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan]

[GitHub] [spark] HyukjinKwon commented on pull request #36534: [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException

2022-05-14 Thread GitBox
HyukjinKwon commented on PR #36534: URL: https://github.com/apache/spark/pull/36534#issuecomment-1126689562 Merged to master, branch-3.3, branch-3.2 and branch-3.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] HyukjinKwon closed pull request #36534: [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException

2022-05-14 Thread GitBox
HyukjinKwon closed pull request #36534: [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException URL: https://github.com/apache/spark/pull/36534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] MaxGekk commented on a diff in pull request #36546: [SPARK-37544][SQL] Correct date arithmetic in sequences

2022-05-14 Thread GitBox
MaxGekk commented on code in PR #36546: URL: https://github.com/apache/spark/pull/36546#discussion_r873062601 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala: ## @@ -964,6 +964,50 @@ class CollectionExpressionsSuite

[GitHub] [spark] AmplabJenkins commented on pull request #36551: [SPARK-38463][CORE] Use error classes in org.apache.spark.input

2022-05-14 Thread GitBox
AmplabJenkins commented on PR #36551: URL: https://github.com/apache/spark/pull/36551#issuecomment-1126795684 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] wangyum commented on pull request #36552: [SPARK-38506][SQL] Push partial aggregation through join

2022-05-14 Thread GitBox
wangyum commented on PR #36552: URL: https://github.com/apache/spark/pull/36552#issuecomment-1126736623 Part of the TPC-DS q24a query plan. Before this PR | After this PR -- | --

[GitHub] [spark] MaxGekk commented on a diff in pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes in the compilation errors of deserializer

2022-05-14 Thread GitBox
MaxGekk commented on code in PR #36479: URL: https://github.com/apache/spark/pull/36479#discussion_r873053587 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -147,14 +147,17 @@ object QueryCompilationErrors extends QueryErrorsBase