[GitHub] [spark] EnricoMi commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox
EnricoMi commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932044028 ## python/pyspark/sql/dataframe.py: ## @@ -2188,6 +2188,142 @@ def cube(self, *cols: "ColumnOrName") -> "GroupedData": # type: ignore[misc] return

[GitHub] [spark] panbingkun commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-07-28 Thread GitBox
panbingkun commented on code in PR #36996: URL: https://github.com/apache/spark/pull/36996#discussion_r932085322 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] physinet opened a new pull request, #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox
physinet opened a new pull request, #37329: URL: https://github.com/apache/spark/pull/37329 ### What changes were proposed in this pull request? Support either literal Python strings or Column objects for the pattern and replacement arguments for `regexp_replace`. ### Why are the

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932183215 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -410,12 +413,24 @@ object V2ScanRelationPushDown extends

[GitHub] [spark] cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering

2022-07-28 Thread GitBox
cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering URL: https://github.com/apache/spark/pull/36918 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1197915164 So, I was thinking about adding ``` case _: Union => var first = true plan.mapChildren { child => if (first) { first =

[GitHub] [spark] wayneguow commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-07-28 Thread GitBox
wayneguow commented on PR #36775: URL: https://github.com/apache/spark/pull/36775#issuecomment-1198234993 IMO, it's better that users can configure what exceptions can ignore corrupt files. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] beliefer commented on pull request #37317: [SPARK-39894][SQL] Combine the similar binary comparison in boolean expression.

2022-07-28 Thread GitBox
beliefer commented on PR #37317: URL: https://github.com/apache/spark/pull/37317#issuecomment-1198064228 ping @MaxGekk @gengliangwang @dongjoon-hyun cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932182709 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932181859 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var

[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932186273 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with

[GitHub] [spark] goutam-git commented on a diff in pull request #37065: [SPARK-38699][SQL] Use error classes in the execution errors of dictionary encoding

2022-07-28 Thread GitBox
goutam-git commented on code in PR #37065: URL: https://github.com/apache/spark/pull/37065#discussion_r932196681 ## sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala: ## @@ -421,7 +421,7 @@ private[columnar] case object

[GitHub] [spark] cloud-fan commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox
cloud-fan commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198119033 `Union.output` is a long-standing issue (same for `Join.output`). It reuses the first child's output but apparently `Union` and its first child output different values. We have to

[GitHub] [spark] ulysses-you opened a new pull request, #37330: [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression

2022-07-28 Thread GitBox
ulysses-you opened a new pull request, #37330: URL: https://github.com/apache/spark/pull/37330 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global=>

[GitHub] [spark] LuciferYang commented on a diff in pull request #37293: [SPARK-39872][SQL] Change to use `BytePackerForLong#unpack8Values` with Array input api in `VectorizedDeltaBinaryPackedReader`

2022-07-28 Thread GitBox
LuciferYang commented on code in PR #37293: URL: https://github.com/apache/spark/pull/37293#discussion_r932335919 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java: ## @@ -300,7 +300,8 @@ private void

[GitHub] [spark] senthh commented on pull request #35785: [SPARK-38213][STREAMING] Adding KafkaSink Metrics feature

2022-07-28 Thread GitBox
senthh commented on PR #35785: URL: https://github.com/apache/spark/pull/35785#issuecomment-1198036728 @dongjoon-hyun @dgd-contributor @gaborgsomogyi @squito Could you be kind to review this PR, Please? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] LuciferYang opened a new pull request, #37331: [SPARK-39913][BUILD] Upgrade to Arrow 9.0.0

2022-07-28 Thread GitBox
LuciferYang opened a new pull request, #37331: URL: https://github.com/apache/spark/pull/37331 ### What changes were proposed in this pull request? Testing with Arrow 9.0.0, will update here later ### Why are the changes needed? ### Does this PR introduce _any_

[GitHub] [spark] AngersZhuuuu commented on pull request #37162: [SPARK-38910][YARN] Clean spark staging before unregister

2022-07-28 Thread GitBox
AngersZh commented on PR #37162: URL: https://github.com/apache/spark/pull/37162#issuecomment-1197950750 ping @dongjoon-hyun The latest GA failed caused by ``` * DONE (miniUI) ERROR: dependency ‘pkgdown’ is not available for package ‘devtools’ * removing

[GitHub] [spark] panbingkun commented on pull request #37314: [SPARK-39891][BUILD] Bump h2 to 2.1.214

2022-07-28 Thread GitBox
panbingkun commented on PR #37314: URL: https://github.com/apache/spark/pull/37314#issuecomment-1197977785 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] MaxGekk commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-07-28 Thread GitBox
MaxGekk commented on code in PR #36996: URL: https://github.com/apache/spark/pull/36996#discussion_r932008623 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] ulysses-you commented on pull request #37275: [SPARK-39835][SQL][3.2] Fix EliminateSorts remove global sort below the local sort

2022-07-28 Thread GitBox
ulysses-you commented on PR #37275: URL: https://github.com/apache/spark/pull/37275#issuecomment-1197941198 cc @cloud-fan ready for branch-3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] ulysses-you commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

2022-07-28 Thread GitBox
ulysses-you commented on PR #37276: URL: https://github.com/apache/spark/pull/37276#issuecomment-1197940919 cc @cloud-fan ready for branch-3.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] huaxingao commented on pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering

2022-07-28 Thread GitBox
huaxingao commented on PR #36918: URL: https://github.com/apache/spark/pull/36918#issuecomment-1198240325 Thanks @cloud-fan @zinking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [WIP] code cleanup for CatalogImpl

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932317590 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -110,53 +108,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {

[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198030620 I don't think that extra `Alias` does any harm in that test, just the expected needs to be amended. My proposal also fixes the issue of the following: ``` SELECT a, b

[GitHub] [spark] cloud-fan commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
cloud-fan commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932204473 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -153,19 +153,24 @@ class CSVOptions( * Disabled by default for backwards

[GitHub] [spark] ala commented on a diff in pull request #37228: [SPARK-37980][SQL] Extend METADATA column to support row indexes

2022-07-28 Thread GitBox
ala commented on code in PR #37228: URL: https://github.com/apache/spark/pull/37228#discussion_r932280019 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -223,8 +216,25 @@ object FileSourceStrategy extends Strategy with

[GitHub] [spark] github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception

2022-07-28 Thread GitBox
github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception URL: https://github.com/apache/spark/pull/36240 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] RS131419 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information

2022-07-28 Thread GitBox
RS131419 commented on code in PR #37230: URL: https://github.com/apache/spark/pull/37230#discussion_r932792260 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -1611,4 +1611,26 @@ class StatisticsSuite extends StatisticsCollectionTestBase with

[GitHub] [spark] cloud-fan commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox
cloud-fan commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198793708 > Is listTables() does not respect current catalog fixed in this PR? I think so, by always passing the fully qualified name to `getTable` in `listTables`. We can add tests later,

[GitHub] [spark] HyukjinKwon commented on pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox
HyukjinKwon commented on PR #37329: URL: https://github.com/apache/spark/pull/37329#issuecomment-1198809042 cc @zero323 FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox
HyukjinKwon commented on code in PR #37329: URL: https://github.com/apache/spark/pull/37329#discussion_r932809698 ## python/pyspark/sql/functions.py: ## @@ -3262,7 +3262,19 @@ def regexp_extract(str: "ColumnOrName", pattern: str, idx: int) -> Column: return

[GitHub] [spark] Yikun commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox
Yikun commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198812267 Sorry for late reply, I'm busy in some local meeting recent days. > In addition, can we get the content of dmesg? @LuciferYang We can add a separate step like: ``` -

[GitHub] [spark] LuciferYang commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox
LuciferYang commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198825739 > Sorry for late reply, I'm busy in some local meeting recent days. > > > In addition, can we get the content of dmesg? > > @LuciferYang We can add a separate step like:

[GitHub] [spark] deshanxiao opened a new pull request, #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-07-28 Thread GitBox
deshanxiao opened a new pull request, #37336: URL: https://github.com/apache/spark/pull/37336 ### What changes were proposed in this pull request? Today we have two SchemaUtils: SQL SchemaUtils and mllib SchemaUtils. This pr is try to remove SchemaUtils in mllib. ### Why are the

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932846004 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2127,6 +2127,15 @@ class Dataset[T] private[sql]( valueColumnName: String): DataFrame

[GitHub] [spark] MaxGekk commented on a diff in pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-07-28 Thread GitBox
MaxGekk commented on code in PR #37337: URL: https://github.com/apache/spark/pull/37337#discussion_r932884678 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalMathUtils.scala: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [spark] ivoson commented on a diff in pull request #37268: [SPARK-39853][CORE] Support stage level task resource schedule for standalone cluster when dynamic allocation disabled

2022-07-28 Thread GitBox
ivoson commented on code in PR #37268: URL: https://github.com/apache/spark/pull/37268#discussion_r928873929 ## core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala: ## @@ -388,14 +388,19 @@ private[spark] class TaskSchedulerImpl( val execId =

[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox
Yikun commented on code in PR #37305: URL: https://github.com/apache/spark/pull/37305#discussion_r932795358 ## dev/lint-python: ## @@ -210,7 +210,7 @@ function black_test { local BLACK_STATUS= # Skip check if black is not installed. -$BLACK_BUILD 2> /dev/null +

[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932847111 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932851912 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2127,6 +2127,15 @@ class Dataset[T] private[sql]( valueColumnName: String): DataFrame

[GitHub] [spark] gengliangwang opened a new pull request, #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-07-28 Thread GitBox
gengliangwang opened a new pull request, #37337: URL: https://github.com/apache/spark/pull/37337 ### What changes were proposed in this pull request? Similar with https://github.com/apache/spark/pull/37313, currently, when arithmetic overflow errors happen under ANSI mode,

[GitHub] [spark] gengliangwang opened a new pull request, #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message

2022-07-28 Thread GitBox
gengliangwang opened a new pull request, #37338: URL: https://github.com/apache/spark/pull/37338 ### What changes were proposed in this pull request? Update the codegen error message for data type which can't be compared by replacing`un-comparable` with `incomparable`

[GitHub] [spark] gengliangwang commented on pull request #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message

2022-07-28 Thread GitBox
gengliangwang commented on PR #37338: URL: https://github.com/apache/spark/pull/37338#issuecomment-1198884914 This is trivial. I found it when working on https://github.com/apache/spark/pull/37337 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] Jonathancui123 commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
Jonathancui123 commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198894995 > Should we keep requirement that `inferDate = true` needs `inferSchema = true`? I think we should clarify semantics. @sadikovi I think we should keep the requirement and

[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932499581 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox
dongjoon-hyun commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932701092 ## python/pyspark/sql/dataframe.py: ## @@ -3237,17 +3237,18 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc] """

[GitHub] [spark] dtenedor commented on pull request #37280: [SPARK-39862][SQL] Fix bugs in existence DEFAULT value lookups for V2 data sources

2022-07-28 Thread GitBox
dtenedor commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198675997 @gengliangwang Sure, this is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox
amaliujia commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198716223 Is `listTables()` does not respect current catalog fixed in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox
dongjoon-hyun commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932774765 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -87,6 +87,21 @@ def test_help_command(self): pydoc.render_doc(df.foo)

[GitHub] [spark] cfmcgrady commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child

2022-07-28 Thread GitBox
cfmcgrady commented on code in PR #37334: URL: https://github.com/apache/spark/pull/37334#discussion_r932804420 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -559,6 +559,17 @@ object RemoveRedundantAliases extends

[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox
Yikun commented on code in PR #37305: URL: https://github.com/apache/spark/pull/37305#discussion_r932810013 ## python/pyspark/ml/feature.py: ## @@ -968,7 +968,7 @@ class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol): def __init__(self, *args: Any):

[GitHub] [spark] Yikun commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox
Yikun commented on PR #37328: URL: https://github.com/apache/spark/pull/37328#issuecomment-1198820043 otherwise LGTM! Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox
HyukjinKwon commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198843792 Let me close this one. I believe all are fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox
HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI URL: https://github.com/apache/spark/pull/37258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932840669 ## python/pyspark/context.py: ## @@ -309,10 +309,7 @@ def _do_init( if sys.version_info[:2] < (3, 8): with warnings.catch_warnings():

[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk commented on PR #37322: URL: https://github.com/apache/spark/pull/37322#issuecomment-1198870290 @anchovYu @cloud-fan @HyukjinKwon @gengliangwang Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox
amaliujia commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198905411 > > Is listTables() does not respect current catalog fixed in this PR? > > I think so, by always passing the fully qualified name to `getTable` in `listTables`. We can add tests

[GitHub] [spark] gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups

2022-07-28 Thread GitBox
gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups URL: https://github.com/apache/spark/pull/37280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups

2022-07-28 Thread GitBox
gengliangwang commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198710296 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion

2022-07-28 Thread GitBox
huaxingao commented on PR #37332: URL: https://github.com/apache/spark/pull/37332#issuecomment-1198735772 The GA failure doesn't seem relevant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox
HyukjinKwon commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932808436 ## python/pyspark/sql/dataframe.py: ## @@ -3244,10 +3244,14 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc] else:

[GitHub] [spark] HyukjinKwon commented on pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-07-28 Thread GitBox
HyukjinKwon commented on PR #37326: URL: https://github.com/apache/spark/pull/37326#issuecomment-1198807963 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] deshanxiao commented on pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-07-28 Thread GitBox
deshanxiao commented on PR #37336: URL: https://github.com/apache/spark/pull/37336#issuecomment-1198839786 CC @gengliangwang @dongjoon-hyun @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932843706 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var

[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion

2022-07-28 Thread GitBox
huaxingao commented on PR #37332: URL: https://github.com/apache/spark/pull/37332#issuecomment-1198736391 @cloud-fan Could you please take a look when you have time? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-07-28 Thread GitBox
HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead' URL: https://github.com/apache/spark/pull/37326 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox
HyukjinKwon commented on PR #37328: URL: https://github.com/apache/spark/pull/37328#issuecomment-1198808336 cc @itholic @xinrong-meng @ueshin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] Yikun commented on pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox
Yikun commented on PR #37305: URL: https://github.com/apache/spark/pull/37305#issuecomment-1198817543 and CI failed due to `[Run / Scala 2.13 build with SBT](https://github.com/grundprinzip/spark/runs/7546678501?check_suite_focus=true)` git clone networking issue, I think we can pass it by

[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
sadikovi commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198896103 Yes, that was my thinking too. Okay, I will make a few changes to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] Yikun commented on a diff in pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox
Yikun commented on code in PR #37328: URL: https://github.com/apache/spark/pull/37328#discussion_r932814726 ## python/pyspark/pandas/series.py: ## @@ -6322,13 +6322,21 @@ def argmax(self, axis: Axis = None, skipna: bool = True) -> int: # If the maximum is achieved

[GitHub] [spark] ulysses-you commented on pull request #36253: [SPARK-38932][SQL] Datasource v2 support report distinct keys

2022-07-28 Thread GitBox
ulysses-you commented on PR #36253: URL: https://github.com/apache/spark/pull/36253#issuecomment-1198822779 cc @cloud-fan @huaxingao if you have time to take a look, thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] c21 commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox
c21 commented on code in PR #37290: URL: https://github.com/apache/spark/pull/37290#discussion_r932846383 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala: ## @@ -117,20 +117,26 @@ object V1WritesUtils { outputColumns: Seq[Attribute],

[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
sadikovi commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198856750 Should we keep requirement that `inferDate = true` needs `inferSchema = true`? I think it is unclear right now. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] c21 commented on a diff in pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value

2022-07-28 Thread GitBox
c21 commented on code in PR #37264: URL: https://github.com/apache/spark/pull/37264#discussion_r932857471 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameAsSchemaSuite.scala: ## @@ -46,15 +46,11 @@ class DataFrameAsSchemaSuite extends QueryTest with SharedSparkSession

[GitHub] [spark] c21 commented on pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value

2022-07-28 Thread GitBox
c21 commented on PR #37264: URL: https://github.com/apache/spark/pull/37264#issuecomment-1198868034 The PR is ready for review again, thanks @cloud-fan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] viirya commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox
viirya commented on code in PR #37290: URL: https://github.com/apache/spark/pull/37290#discussion_r932864145 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala: ## @@ -107,8 +108,10 @@ object FileFormatWriter extends Logging {

[GitHub] [spark] MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead URL: https://github.com/apache/spark/pull/37322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk commented on PR #37322: URL: https://github.com/apache/spark/pull/37322#issuecomment-1198880973 Merging to master. Thank you, @gengliangwang for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox
dongjoon-hyun commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932517832 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox
dongjoon-hyun commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932517832 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog

[GitHub] [spark] ueshin commented on a diff in pull request #35391: [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion

2022-07-28 Thread GitBox
ueshin commented on code in PR #35391: URL: https://github.com/apache/spark/pull/35391#discussion_r932566706 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -953,6 +953,30 @@ def test_to_pandas_from_mixed_dataframe(self): pdf_with_only_nulls =

[GitHub] [spark] santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox
santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support multiple column drop URL: https://github.com/apache/spark/pull/37335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cabral1888 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information

2022-07-28 Thread GitBox
cabral1888 commented on code in PR #37230: URL: https://github.com/apache/spark/pull/37230#discussion_r932418981 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -1611,4 +1611,26 @@ class StatisticsSuite extends StatisticsCollectionTestBase

[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932495675 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest

[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
Jonathancui123 commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932486986 ## docs/sql-data-sources-csv.md: ## @@ -109,7 +109,7 @@ Data source options of CSV can be set via: read -inferDate +preferDate false

[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
Jonathancui123 commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932486986 ## docs/sql-data-sources-csv.md: ## @@ -109,7 +109,7 @@ Data source options of CSV can be set via: read -inferDate +preferDate false

[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932506853 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest

[GitHub] [spark] otterc commented on pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-07-28 Thread GitBox
otterc commented on PR #35906: URL: https://github.com/apache/spark/pull/35906#issuecomment-1198491146 > Should be easy to add. We can have a feature flag, and when initiate the RemoteBlockPushResolver, db can be set to null if this feature flag is turned off, and all the later DB

[GitHub] [spark] sunchao commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-28 Thread GitBox
sunchao commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r932515356 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala: ## @@ -17,22 +17,33 @@ package

[GitHub] [spark] EnricoMi commented on pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox
EnricoMi commented on PR #37304: URL: https://github.com/apache/spark/pull/37304#issuecomment-1198506801 > btw, you may also need to run `dev/reformat-python` Why do I have to reformat `python/pyspark/context.py`? That seems unrelated. -- This is an automated message from the

[GitHub] [spark] santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox
santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support multiple column drop URL: https://github.com/apache/spark/pull/37333 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] sadikovi commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox
sadikovi commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932712356 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -153,19 +153,24 @@ class CSVOptions( * Disabled by default for backwards

[GitHub] [spark] gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert

2022-07-28 Thread GitBox
gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert URL: https://github.com/apache/spark/pull/37311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] peter-toth opened a new pull request, #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child

2022-07-28 Thread GitBox
peter-toth opened a new pull request, #37334: URL: https://github.com/apache/spark/pull/37334 ### What changes were proposed in this pull request? Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues. ### Why

[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198525757 I've opened a PR with my proposal here: https://github.com/apache/spark/pull/37334 -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] santosh-d3vpl3x opened a new pull request, #37335: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox
santosh-d3vpl3x opened a new pull request, #37335: URL: https://github.com/apache/spark/pull/37335 * SPARK-39895 pyspark support multiple column drop ### What changes were proposed in this pull request? Fixes issues related type confirmation in pyspark api ### Why are the

[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix bug in existence DEFAULT value lookups for V2 data sources

2022-07-28 Thread GitBox
gengliangwang commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198601705 @dtenedor could you also update the PR description about the ORC fix? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] ulysses-you commented on pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox
ulysses-you commented on PR #37290: URL: https://github.com/apache/spark/pull/37290#issuecomment-1197941930 cc @viirya @cloud-fan @c21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cfmcgrady commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox
cfmcgrady commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1197968855 hi, @peter-toth thank you for your feedback. While these changes of `RemoveRedundantAliases` solve this issue, they break the guarantee of `alias removal should not break after push

  1   2   >