[GitHub] [spark] zzzzming95 commented on pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-07 Thread via GitHub
ming95 commented on PR #40688: URL: https://github.com/apache/spark/pull/40688#issuecomment-1500796913 The CI build failure doesn't seem to be caused by this patch, can you take a look? @dongjoon-hyun @viirya -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] zzzzming95 commented on pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-07 Thread via GitHub
ming95 commented on PR #40688: URL: https://github.com/apache/spark/pull/40688#issuecomment-1500796714 > Maybe, no? If this is not working properly before, we cannot enable this configuration at Apache Spark 3.5.0. Since we need to wait for one release cycle, we may be able to do that

[GitHub] [spark] dongjoon-hyun commented on pull request #40663: [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40663: URL: https://github.com/apache/spark/pull/40663#issuecomment-1500756781 No problem at all. Thank you always, @LuciferYang ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HeartSaVioR commented on pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on PR #40561: URL: https://github.com/apache/spark/pull/40561#issuecomment-1500752714 The last update is to rebase with master branch - just to make sure CI is happy with the change before merging this. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] github-actions[bot] closed pull request #38896: [WIP][SQL] Replace `require()` by an internal error in catalyst

2023-04-07 Thread via GitHub
github-actions[bot] closed pull request #38896: [WIP][SQL] Replace `require()` by an internal error in catalyst URL: https://github.com/apache/spark/pull/38896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] github-actions[bot] closed pull request #38893: [Spark-40099][SQL] Merge adjacent CaseWhen branches if their values are the same

2023-04-07 Thread via GitHub
github-actions[bot] closed pull request #38893: [Spark-40099][SQL] Merge adjacent CaseWhen branches if their values are the same URL: https://github.com/apache/spark/pull/38893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] github-actions[bot] commented on pull request #39021: [SPARK-41483][CORE] Last metrics system report should have a timeout, avoid to lead shutdown hook timeout

2023-04-07 Thread via GitHub
github-actions[bot] commented on PR #39021: URL: https://github.com/apache/spark/pull/39021#issuecomment-1500737503 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] closed pull request #39219: [WIP][SPARK-41277] Auto infer bucketing info for shuffled actions

2023-04-07 Thread via GitHub
github-actions[bot] closed pull request #39219: [WIP][SPARK-41277] Auto infer bucketing info for shuffled actions URL: https://github.com/apache/spark/pull/39219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] github-actions[bot] commented on pull request #39259: [SPARK-41739][SQL] CheckRule should not be executed when analyze view child

2023-04-07 Thread via GitHub
github-actions[bot] commented on PR #39259: URL: https://github.com/apache/spark/pull/39259#issuecomment-1500737477 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39541: [SPARK-42043][CONNECT] Scala Client Result with E2E Tests

2023-04-07 Thread via GitHub
HyukjinKwon commented on code in PR #39541: URL: https://github.com/apache/spark/pull/39541#discussion_r1161032682 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the

[GitHub] [spark] gengliangwang commented on pull request #40711: [SPARK-43072][DOC] Include TIMESTAMP_NTZ type in ANSI Compliance doc

2023-04-07 Thread via GitHub
gengliangwang commented on PR #40711: URL: https://github.com/apache/spark/pull/40711#issuecomment-1500727241 cc @xinrong-meng it would be great to include this in the doc of Spark 3.4.0. (Document changes won't fail RC vote) -- This is an automated message from the Apache Git

[GitHub] [spark] gengliangwang commented on pull request #40711: [SPARK-43072][DOC] Include TIMESTAMP_NTZ type in ANSI Compliance doc

2023-04-07 Thread via GitHub
gengliangwang commented on PR #40711: URL: https://github.com/apache/spark/pull/40711#issuecomment-1500726183 I will come up with screenshots from branch-3.4. The markdown tables in the master branch are not showing properly. cc @grundprinzip

[GitHub] [spark] gengliangwang opened a new pull request, #40711: [SPARK-43072][DOC] Include TIMESTAMP_NTZ type in ANSI Compliance doc

2023-04-07 Thread via GitHub
gengliangwang opened a new pull request, #40711: URL: https://github.com/apache/spark/pull/40711 ### What changes were proposed in this pull request? There are important syntax rules about Cast/Store assignment/Type precedent list in the [ANSI Compliance

[GitHub] [spark] dtenedor commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-07 Thread via GitHub
dtenedor commented on code in PR #40710: URL: https://github.com/apache/spark/pull/40710#discussion_r1161013565 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog:

[GitHub] [spark] dongjoon-hyun closed pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun closed pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0 URL: https://github.com/apache/spark/pull/40709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] dongjoon-hyun commented on pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40709: URL: https://github.com/apache/spark/pull/40709#issuecomment-1500709472 Merged to master for Apache Spark 3.5. Thank you, @huaxingao and @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] dongjoon-hyun commented on pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40709: URL: https://github.com/apache/spark/pull/40709#issuecomment-1500709184 Thank you so much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dtenedor commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-07 Thread via GitHub
dtenedor commented on code in PR #40710: URL: https://github.com/apache/spark/pull/40710#discussion_r1161011837 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog:

[GitHub] [spark] dongjoon-hyun commented on pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40709: URL: https://github.com/apache/spark/pull/40709#issuecomment-1500708179 Could you review this PR, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] dongjoon-hyun commented on pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40709: URL: https://github.com/apache/spark/pull/40709#issuecomment-1500708097 Documentation generation GitHub Action job passed. ![Screenshot 2023-04-07 at 3 56 39

[GitHub] [spark] gengliangwang commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-07 Thread via GitHub
gengliangwang commented on code in PR #40710: URL: https://github.com/apache/spark/pull/40710#discussion_r1161010584 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -91,6 +90,25 @@ case class

[GitHub] [spark] gengliangwang commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-07 Thread via GitHub
gengliangwang commented on code in PR #40710: URL: https://github.com/apache/spark/pull/40710#discussion_r1161009871 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -91,6 +90,25 @@ case class

[GitHub] [spark] dtenedor opened a new pull request, #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-07 Thread via GitHub
dtenedor opened a new pull request, #40710: URL: https://github.com/apache/spark/pull/40710 ### What changes were proposed in this pull request? This PR extends column default support to allow the ORDER BY, LIMIT, and OFFSET clauses at the end of a SELECT query in the INSERT source

[GitHub] [spark] ueshin commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
ueshin commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160985748 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] ueshin commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
ueshin commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160985748 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] ueshin commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
ueshin commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160985748 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] zhenlineo closed pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-04-07 Thread via GitHub
zhenlineo closed pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests URL: https://github.com/apache/spark/pull/40274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] jiangxb1987 commented on a diff in pull request #40690: [SPARK-43043][CORE] Improve the performance of MapOutputTracker.updateMapOutput

2023-04-07 Thread via GitHub
jiangxb1987 commented on code in PR #40690: URL: https://github.com/apache/spark/pull/40690#discussion_r1160971328 ## core/src/main/scala/org/apache/spark/MapOutputTracker.scala: ## @@ -157,22 +164,29 @@ private class ShuffleStatus(

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160970094 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160970094 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160970094 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160967676 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160967676 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] jiangxb1987 commented on pull request #40690: [SPARK-43043][CORE] Improve the performance of MapOutputTracker.updateMapOutput

2023-04-07 Thread via GitHub
jiangxb1987 commented on PR #40690: URL: https://github.com/apache/spark/pull/40690#issuecomment-1500652566 This happens on a benchmark job generating a large number of very tiny blocks. When the job is finished, the cluster tries to shutdown the idle executors and migrate all the blocks

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160964896 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160964896 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] amaliujia commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
amaliujia commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160953776 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] amaliujia commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
amaliujia commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160953776 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] amaliujia commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
amaliujia commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160953776 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] amaliujia commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-07 Thread via GitHub
amaliujia commented on code in PR #40692: URL: https://github.com/apache/spark/pull/40692#discussion_r1160953776 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -60,13 +61,19 @@ private[sql] class SparkResult[T](

[GitHub] [spark] dongjoon-hyun commented on pull request #40709: [SPARK-43070][BUILD] Upgrade `sbt-unidoc` to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40709: URL: https://github.com/apache/spark/pull/40709#issuecomment-1500625015 Yes, correctly. Apache Spark 3.2.0+ uses SBT 1.5.0+ via SPARK-34959. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] WweiL commented on pull request #40691: [SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming

2023-04-07 Thread via GitHub
WweiL commented on PR #40691: URL: https://github.com/apache/spark/pull/40691#issuecomment-1500623778 CC @rangadi @pengzhon-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] warrenzhu25 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

2023-04-07 Thread via GitHub
warrenzhu25 commented on code in PR #39280: URL: https://github.com/apache/spark/pull/39280#discussion_r1160949314 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2242,6 +2242,16 @@ package object config { .checkValue(_ >= 0, "needs to be a

[GitHub] [spark] dongjoon-hyun commented on pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #39280: URL: https://github.com/apache/spark/pull/39280#issuecomment-1500617216 Gentle ping @Ngone51 once more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

2023-04-07 Thread via GitHub
dongjoon-hyun commented on code in PR #39280: URL: https://github.com/apache/spark/pull/39280#discussion_r1160945224 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2242,6 +2242,16 @@ package object config { .checkValue(_ >= 0, "needs to be

[GitHub] [spark] dongjoon-hyun opened a new pull request, #40709: [SPARK-43070][BUILD] Upgrade sbt-unidoc to 0.5.0

2023-04-07 Thread via GitHub
dongjoon-hyun opened a new pull request, #40709: URL: https://github.com/apache/spark/pull/40709 ### What changes were proposed in this pull request? This PR aims to upgrade `sbt-unidoc` to 0.5.0. ### Why are the changes needed? Since v0.5.0, organization has moved from

[GitHub] [spark] dongjoon-hyun commented on pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40708: URL: https://github.com/apache/spark/pull/40708#issuecomment-1500594750 I tested this manually. Merged to master/3.4/3.3/3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] dongjoon-hyun closed pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
dongjoon-hyun closed pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin` URL: https://github.com/apache/spark/pull/40708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun commented on pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40708: URL: https://github.com/apache/spark/pull/40708#issuecomment-1500587401 Thank you, @viirya . The description is fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] viirya commented on pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
viirya commented on PR #40708: URL: https://github.com/apache/spark/pull/40708#issuecomment-1500583121 > This PR aims to use set-eclipse instead of sbteclipse-plugin. One typo `set-eclipse` in the description. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] ueshin commented on a diff in pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

2023-04-07 Thread via GitHub
ueshin commented on code in PR #40015: URL: https://github.com/apache/spark/pull/40015#discussion_r1160908837 ## python/pyspark/sql/connect/plan.py: ## @@ -1830,14 +1831,24 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: class CacheTable(LogicalPlan):

[GitHub] [spark] dongjoon-hyun commented on pull request #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40708: URL: https://github.com/apache/spark/pull/40708#issuecomment-1500561170 Could you review this, @viirya ? Although the build system seems to be recovering now, I want to reduce the chance of failures in the future by switching the repo. -

[GitHub] [spark] anishshri-db commented on pull request #40696: [SPARK-43056][SS] RocksDB state store commit should continue b/ground work only if its paused

2023-04-07 Thread via GitHub
anishshri-db commented on PR #40696: URL: https://github.com/apache/spark/pull/40696#issuecomment-150022 @HeartSaVioR - all tests passed. Please merge when you get a chance. Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] dongjoon-hyun opened a new pull request, #40708: [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin`

2023-04-07 Thread via GitHub
dongjoon-hyun opened a new pull request, #40708: URL: https://github.com/apache/spark/pull/40708 ### What changes were proposed in this pull request? This PR aims to use `set-eclipse` instead of `sbteclipse-plugin`. ### Why are the changes needed? Thanks to SPARK-34959,

[GitHub] [spark] amaliujia commented on pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-04-07 Thread via GitHub
amaliujia commented on PR #40315: URL: https://github.com/apache/spark/pull/40315#issuecomment-1500531162 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] amaliujia commented on pull request #40656: [SPARK-43023][CONNECT][TESTS] Add switch catalog testing scenario for `CatalogSuite`

2023-04-07 Thread via GitHub
amaliujia commented on PR #40656: URL: https://github.com/apache/spark/pull/40656#issuecomment-1500518103 late LGTM! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] RyanBerti commented on pull request #40615: [WIP][SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-07 Thread via GitHub
RyanBerti commented on PR #40615: URL: https://github.com/apache/spark/pull/40615#issuecomment-1500518140 @dtenedor FYI, I updated the tests and am just missing one for empty input table, and one for merging sparse/dense sketches. Once I get the build to be green, I'm going to remove the

[GitHub] [spark] dongjoon-hyun commented on pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-07 Thread via GitHub
dongjoon-hyun commented on PR #40688: URL: https://github.com/apache/spark/pull/40688#issuecomment-1500503549 Maybe, no? If this is not working properly before, we cannot enable this configuration at Apache Spark 3.5.0. Since we need to wait for one release cycle, we may be able to do that

[GitHub] [spark] clownxc closed pull request #40703: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-07 Thread via GitHub
clownxc closed pull request #40703: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks URL: https://github.com/apache/spark/pull/40703 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] clownxc opened a new pull request, #40707: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-07 Thread via GitHub
clownxc opened a new pull request, #40707: URL: https://github.com/apache/spark/pull/40707 ## What changes were proposed in this pull request? This PR update the task retry logic to not retry if the exception has an error class which means a user error. ## Why are the changes

[GitHub] [spark] zzzzming95 commented on pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-07 Thread via GitHub
ming95 commented on PR #40688: URL: https://github.com/apache/spark/pull/40688#issuecomment-1500479527 One more question , it time to make the default value of `SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED` as true ? -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] zzzzming95 commented on a diff in pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-07 Thread via GitHub
ming95 commented on code in PR #40688: URL: https://github.com/apache/spark/pull/40688#discussion_r1160843998 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala: ## @@ -60,6 +61,7 @@ case class InsertAdaptiveSparkPlan(

[GitHub] [spark] anishshri-db commented on pull request #40696: [SPARK-43056][SS] RocksDB state store commit should continue b/ground work only if its paused

2023-04-07 Thread via GitHub
anishshri-db commented on PR #40696: URL: https://github.com/apache/spark/pull/40696#issuecomment-1500446559 > Could you please rebase so that CI is retriggered? If the new trial fails again, maybe good to post to dev@ and see whether someone encountered this before, and/or someone is

[GitHub] [spark] aokolnychyi commented on pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-04-07 Thread via GitHub
aokolnychyi commented on PR #40308: URL: https://github.com/apache/spark/pull/40308#issuecomment-1500441091 Failures don't seem to be related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] rangadi closed pull request #40373: [Draft] Streaming Spark Connect POC

2023-04-07 Thread via GitHub
rangadi closed pull request #40373: [Draft] Streaming Spark Connect POC URL: https://github.com/apache/spark/pull/40373 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-07 Thread via GitHub
rangadi commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1160800555 ## python/pyspark/sql/dataframe.py: ## @@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = None) -> "DataFrame": jdf =

[GitHub] [spark] warrenzhu25 commented on pull request #38852: [SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor

2023-04-07 Thread via GitHub
warrenzhu25 commented on PR #38852: URL: https://github.com/apache/spark/pull/38852#issuecomment-1500375839 @holdenk @dongjoon-hyun @Ngone51 Help take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] warrenzhu25 commented on a diff in pull request #39280: [SPARK-41766][CORE] Handle decommission request sent before executor registration

2023-04-07 Thread via GitHub
warrenzhu25 commented on code in PR #39280: URL: https://github.com/apache/spark/pull/39280#discussion_r1160765918 ## core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala: ## @@ -102,6 +103,15 @@ class

[GitHub] [spark] cloud-fan commented on pull request #40701: [SPARK-43064][SQL] Spark SQL CLI SQL tab should only show once statement once

2023-04-07 Thread via GitHub
cloud-fan commented on PR #40701: URL: https://github.com/apache/spark/pull/40701#issuecomment-1500331931 https://github.com/apache/spark/pull/40437 might be related. We want to remove `hiveResultString` from CLI and only use it in hive compatibility tests. -- This is an automated

[GitHub] [spark] itholic opened a new pull request, #40706: [SPARK-43059][CONNECT][PYTHON] Migrate TypeError from DataFrame(Reader|Writer) into error class

2023-04-07 Thread via GitHub
itholic opened a new pull request, #40706: URL: https://github.com/apache/spark/pull/40706 ### What changes were proposed in this pull request? This PR proposes to migrate TypeError from DataFrame(Reader|Writer) into error class ### Why are the changes needed? Improve

[GitHub] [spark] cloud-fan commented on a diff in pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-07 Thread via GitHub
cloud-fan commented on code in PR #40697: URL: https://github.com/apache/spark/pull/40697#discussion_r1160503755 ## sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala: ## @@ -750,37 +750,29 @@ case class WholeStageCodegenExec(child:

[GitHub] [spark] HeartSaVioR commented on pull request #40705: [SPARK-43067][SS] Correct the location of error class resource file in Kafka connector

2023-04-07 Thread via GitHub
HeartSaVioR commented on PR #40705: URL: https://github.com/apache/spark/pull/40705#issuecomment-1500261827 This is introduced from 3.4 hence ideal to land the fix to 3.4, but the possibility to trigger the bug is relatively very low, hence probably not urgent. -- This is an automated

[GitHub] [spark] HeartSaVioR opened a new pull request, #40705: [SPARK-43067][SS] Correct the location of error class resource file in Kafka connector

2023-04-07 Thread via GitHub
HeartSaVioR opened a new pull request, #40705: URL: https://github.com/apache/spark/pull/40705 ### What changes were proposed in this pull request? This PR moves the error class resource file in Kafka connector from test to src, so that error class works without test artifacts.

[GitHub] [spark] MaxGekk opened a new pull request, #40704: [WIP][SPARK-43038][SQL] Support the CBC mode by `aes_encrypt()`/`aes_decrypt()`

2023-04-07 Thread via GitHub
MaxGekk opened a new pull request, #40704: URL: https://github.com/apache/spark/pull/40704 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] clownxc opened a new pull request, #40703: [SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

2023-04-07 Thread via GitHub
clownxc opened a new pull request, #40703: URL: https://github.com/apache/spark/pull/40703 ## What changes were proposed in this pull request? This PR update the task retry logic to not retry if the exception has an error class which means a user error. ## Why are the changes needed?

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160631080 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1073,6 +1074,91 @@ class SparkConnectPlanner(val

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160620933 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala: ## @@ -584,6 +585,86 @@ final class DataFrameStatFunctions

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160620933 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala: ## @@ -584,6 +585,86 @@ final class DataFrameStatFunctions

[GitHub] [spark] beliefer commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-07 Thread via GitHub
beliefer commented on PR #40697: URL: https://github.com/apache/spark/pull/40697#issuecomment-1500156450 > @beliefer This is not a performance feature. It's just to avoid people making mistakes referencing extra objects in the closure, which can slow down task serialization and increase

[GitHub] [spark] LuciferYang commented on pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on PR #40352: URL: https://github.com/apache/spark/pull/40352#issuecomment-1500147099 GA failure is not related to the current PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] LuciferYang commented on pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on PR #40352: URL: https://github.com/apache/spark/pull/40352#issuecomment-1500146157 In the last commit, make `BloomFilterAggregate` explicitly supported `IntegerType/ShortType/ByteType` and added corresponding updaters, then removed pass `dataType` and `adding cast

[GitHub] [spark] HeartSaVioR opened a new pull request, #40702: [SPARK-43066][SQL] Add test for dropDuplicates in JavaDatasetSuite

2023-04-07 Thread via GitHub
HeartSaVioR opened a new pull request, #40702: URL: https://github.com/apache/spark/pull/40702 ### What changes were proposed in this pull request? This PR proposes to add test for dropDuplicates in JavaDatasetSuite. ### Why are the changes needed? The API dropDuplicates

[GitHub] [spark] Yikf commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-07 Thread via GitHub
Yikf commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-150019 After code validation, ThriftServerQueryTestSuite and SQLQueryTestSuite depend on goldgen files; If goldgen file follows the format of df.show (the format of df.show depends on the

[GitHub] [spark] yaooqinn commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-07 Thread via GitHub
yaooqinn commented on PR #40697: URL: https://github.com/apache/spark/pull/40697#issuecomment-1500115645 `PartitionEvaluator` looks better to me, altho I don't have a strong option either. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] yaooqinn commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-07 Thread via GitHub
yaooqinn commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1500113298 Adjusting `df.show` may need to change the output of `show` first. Some data values do not have a nice string representation yet -- This is an automated message from the Apache Git

[GitHub] [spark] wangyum commented on pull request #40114: [SPARK-42513][SQL] Push down topK through join

2023-04-07 Thread via GitHub
wangyum commented on PR #40114: URL: https://github.com/apache/spark/pull/40114#issuecomment-1500110950 Date | No. of queries optimized by this patch | No. of total queries -- | -- | -- 2023/4/5 | 62 | 167608 2023/4/4 | 139 | 203393 2023/4/3 | 62 | 191147 2023/4/2 | 14 |

[GitHub] [spark] AngersZhuuuu commented on pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-04-07 Thread via GitHub
AngersZh commented on PR #40315: URL: https://github.com/apache/spark/pull/40315#issuecomment-1500104571 @amaliujia Like current? also ping @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160566049 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala: ## @@ -78,7 +79,7 @@ case class

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160563946 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1154,6 +1155,91 @@ class SparkConnectPlanner(val

[GitHub] [spark] Hisoka-X commented on a diff in pull request #40632: [SPARK-42298][SQL] Assign name to _LEGACY_ERROR_TEMP_2132

2023-04-07 Thread via GitHub
Hisoka-X commented on code in PR #40632: URL: https://github.com/apache/spark/pull/40632#discussion_r1160562342 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -1404,8 +1404,8 @@ private[sql] object QueryExecutionErrors extends

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40701: [SPARK-43064][SQL] Spark SQL CLI SQL tab should only show once statement once

2023-04-07 Thread via GitHub
AngersZh commented on code in PR #40701: URL: https://github.com/apache/spark/pull/40701#discussion_r1160561169 ## sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala: ## @@ -65,8 +66,13 @@ private[hive] class SparkSQLDriver(val

[GitHub] [spark] MaxGekk commented on a diff in pull request #39937: [SPARK-42309][SQL] Introduce `INCOMPATIBLE_DATA_TO_TABLE` and sub classes.

2023-04-07 Thread via GitHub
MaxGekk commented on code in PR #39937: URL: https://github.com/apache/spark/pull/39937#discussion_r1160535300 ## sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala: ## @@ -776,38 +808,62 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {

[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1160557436 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1073,6 +1074,91 @@ class SparkConnectPlanner(val

[GitHub] [spark] LuciferYang commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40605: URL: https://github.com/apache/spark/pull/40605#discussion_r1160556535 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -62,15 +62,29 @@ object

[GitHub] [spark] cloud-fan commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-07 Thread via GitHub
cloud-fan commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1500082745 also cc @AngersZh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] cloud-fan commented on a diff in pull request #40701: [SPARK-43064][SQL] Spark SQL CLI SQL tab should only show once statement once

2023-04-07 Thread via GitHub
cloud-fan commented on code in PR #40701: URL: https://github.com/apache/spark/pull/40701#discussion_r1160551428 ## sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala: ## @@ -65,8 +66,13 @@ private[hive] class SparkSQLDriver(val

[GitHub] [spark] cloud-fan commented on a diff in pull request #40701: [SPARK-43064][SQL] Spark SQL CLI SQL tab should only show once statement once

2023-04-07 Thread via GitHub
cloud-fan commented on code in PR #40701: URL: https://github.com/apache/spark/pull/40701#discussion_r1160551276 ## sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala: ## @@ -65,8 +66,13 @@ private[hive] class SparkSQLDriver(val

[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should also stop SparkContext when exit program in yarn mode and pass exitCode to AM side

2023-04-07 Thread via GitHub
AngersZh commented on PR #40314: URL: https://github.com/apache/spark/pull/40314#issuecomment-1500073974 ping @dongjoon-hyun @HyukjinKwon @attilapiros @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] Yikf commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-07 Thread via GitHub
Yikf commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1500073818 > I'm looking for consistency. `df.show` is what users see, and `hiveResultString` is for golden files. Shouldn't the golden file match what users really see? Why do we test something that

[GitHub] [spark] LuciferYang commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-04-07 Thread via GitHub
LuciferYang commented on code in PR #40605: URL: https://github.com/apache/spark/pull/40605#discussion_r1160542658 ## dev/connect-jvm-client-mima-check: ## @@ -34,20 +34,18 @@ fi rm -f .connect-mima-check-result -echo "Build sql module, connect-client-jvm module and

  1   2   >