[GitHub] [spark] MaxGekk closed pull request #36426: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

2022-05-02 Thread GitBox
MaxGekk closed pull request #36426: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json URL: https://github.com/apache/spark/pull/36426 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] MaxGekk commented on pull request #36426: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

2022-05-02 Thread GitBox
MaxGekk commented on PR #36426: URL: https://github.com/apache/spark/pull/36426#issuecomment-1115751492 Merging to master. Thank you, @HyukjinKwon for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] MaxGekk closed pull request #36428: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread GitBox
MaxGekk closed pull request #36428: [SPARK-39087][SQL] Improve messages of error classes URL: https://github.com/apache/spark/pull/36428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] MaxGekk commented on pull request #36428: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread GitBox
MaxGekk commented on PR #36428: URL: https://github.com/apache/spark/pull/36428#issuecomment-1115747192 Merging to master. Thank you, @HyukjinKwon for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] sunchao commented on pull request #36427: [SPARK-39086][SQL] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
sunchao commented on PR #36427: URL: https://github.com/apache/spark/pull/36427#issuecomment-1115746980 Looks like `CodeGenerator.getValueFromVector` and `CodeGenerator.getValue` needs to be updated since previously a `ColumnVector` won't be UDT type, but now it can. If input `dataType` is

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #36414: URL: https://github.com/apache/spark/pull/36414#discussion_r863330808 ## python/pyspark/pandas/series.py: ## @@ -6859,13 +6860,16 @@ def _reduce_for_stat_function( sfun : the stats function to be used for aggregation

[GitHub] [spark] ravwojdyla commented on a diff in pull request #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
ravwojdyla commented on code in PR #36430: URL: https://github.com/apache/spark/pull/36430#discussion_r863329395 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1593,6 +1593,35 @@ class Dataset[T] private[sql]( @scala.annotation.varargs def

[GitHub] [spark] beliefer commented on pull request #36405: [SPARK-39065][SQL] DS V2 Limit push-down should avoid out of memory

2022-05-02 Thread GitBox
beliefer commented on PR #36405: URL: https://github.com/apache/spark/pull/36405#issuecomment-1115548457 ping @huaxingao cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] beliefer commented on a diff in pull request #36405: [SPARK-39065][SQL] DS V2 Limit push-down should avoid out of memory

2022-05-02 Thread GitBox
beliefer commented on code in PR #36405: URL: https://github.com/apache/spark/pull/36405#discussion_r863326711 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala: ## @@ -131,7 +131,8 @@ case class JDBCScanBuilder( }

[GitHub] [spark] beliefer commented on pull request #36417: [SPARK-39057][SQL] Offset could work without Limit

2022-05-02 Thread GitBox
beliefer commented on PR #36417: URL: https://github.com/apache/spark/pull/36417#issuecomment-1115547178 ping @dtenedor cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] sadikovi commented on pull request #36427: [SPARK-39086][SQL] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
sadikovi commented on PR #36427: URL: https://github.com/apache/spark/pull/36427#issuecomment-1115533499 Does anyone know why the tests would fail? All of them seem to fail with `[info] Cause: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 114, Column

[GitHub] [spark] HyukjinKwon commented on pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36435: URL: https://github.com/apache/spark/pull/36435#issuecomment-1115507707 cc @cloud-fan and @cfmcgrady -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #33436: [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv`

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #33436: URL: https://github.com/apache/spark/pull/33436#discussion_r863289519 ## docs/sql-migration-guide.md: ## @@ -22,6 +22,10 @@ license: | * Table of contents {:toc} +## Upgrading from Spark SQL 3.2 to 3.3 + + - Since Spark 3.3,

[GitHub] [spark] HyukjinKwon opened a new pull request, #36435: [SPARK-35912][SQL][FOLLOW-UP]Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

2022-05-02 Thread GitBox
HyukjinKwon opened a new pull request, #36435: URL: https://github.com/apache/spark/pull/36435 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436 that adds a legacy configuration. It's found that that it can break

[GitHub] [spark] github-actions[bot] commented on pull request #34934: [SPARK-37675][CORE][SHUFFLE] Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-05-02 Thread GitBox
github-actions[bot] commented on PR #34934: URL: https://github.com/apache/spark/pull/34934#issuecomment-1115490761 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #35254: [SPARK-37966][SQL] Static partition insert should write _SUCCESS under partition location

2022-05-02 Thread GitBox
github-actions[bot] commented on PR #35254: URL: https://github.com/apache/spark/pull/35254#issuecomment-1115490701 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] panbingkun commented on pull request #36431: [SPARK-38744][SQL][TESTS] Test the error class: NON_LITERAL_PIVOT_VALUES

2022-05-02 Thread GitBox
panbingkun commented on PR #36431: URL: https://github.com/apache/spark/pull/36431#issuecomment-1115486968 > GA test failed > > ``` > 2022-05-02T12:32:09.6882052Z - NON_LITERAL_PIVOT_VALUES: literal expressions required for pivot values *** FAILED *** (33 milliseconds) >

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #33436: [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv`

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #33436: URL: https://github.com/apache/spark/pull/33436#discussion_r863271189 ## docs/sql-migration-guide.md: ## @@ -22,6 +22,10 @@ license: | * Table of contents {:toc} +## Upgrading from Spark SQL 3.2 to 3.3 + + - Since Spark 3.3,

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #36430: URL: https://github.com/apache/spark/pull/36430#discussion_r863267242 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1593,6 +1593,35 @@ class Dataset[T] private[sql]( @scala.annotation.varargs def

[GitHub] [spark] HyukjinKwon commented on pull request #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36430: URL: https://github.com/apache/spark/pull/36430#issuecomment-1115478312 @jiangxb1987 who were interested in this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36432: [SPARK-39029][PYTHON][TEST]Improve the test coverage for pyspark/broadcast.py

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #36432: URL: https://github.com/apache/spark/pull/36432#discussion_r863260459 ## python/pyspark/tests/test_broadcast.py: ## @@ -99,6 +101,30 @@ def test_broadcast_value_against_gc(self): finally: b.destroy() +def

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36432: [SPARK-39029][PYTHON][TEST]Improve the test coverage for pyspark/broadcast.py

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #36432: URL: https://github.com/apache/spark/pull/36432#discussion_r863260273 ## python/pyspark/tests/test_broadcast.py: ## @@ -99,6 +101,30 @@ def test_broadcast_value_against_gc(self): finally: b.destroy() +def

[GitHub] [spark] HyukjinKwon closed pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
HyukjinKwon closed pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion URL: https://github.com/apache/spark/pull/36425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] HyukjinKwon commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115464988 Thanks guys!! Merged to master, branch-3.3, branch-3.2, branch-3.1 and branch-3.0. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] sadikovi commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
sadikovi commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115442507 Thanks, @ankurdave. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] sadikovi commented on a diff in pull request #36427: [SPARK-39086][SQL] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
sadikovi commented on code in PR #36427: URL: https://github.com/apache/spark/pull/36427#discussion_r863240629 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala: ## @@ -208,6 +208,9 @@ object ParquetUtils { case st: StructType

[GitHub] [spark] ankurdave commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
ankurdave commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115436722 I see, so the use-after-free that caused the crash is occurring in the main task thread? In that case this fix LGTM. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] sadikovi commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
sadikovi commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115432194 I do have a crash file but I cannot share it unfortunately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] sadikovi commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
sadikovi commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115430904 The reason the bug existed was because the code was trying to control completion from Python which is not going to work in most of the cases. -- This is an automated message from the

[GitHub] [spark] xinrong-databricks commented on pull request #36391: [SPARK-39053][PYTHON] Use pandas series index infer in multi-index dtypes

2022-05-02 Thread GitBox
xinrong-databricks commented on PR #36391: URL: https://github.com/apache/spark/pull/36391#issuecomment-1115430786 Sounds good, thanks for the follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] sadikovi commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
sadikovi commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115430254 @ankurdave This PR solves just that because the synchronisation is between separate JVM threads instead which works. My test reproduces the issue and I confirm it is fixed with the

[GitHub] [spark] xinrong-databricks commented on pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

2022-05-02 Thread GitBox
xinrong-databricks commented on PR #36414: URL: https://github.com/apache/spark/pull/36414#issuecomment-1115428051 CC @zhengruifeng Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] xinrong-databricks commented on pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

2022-05-02 Thread GitBox
xinrong-databricks commented on PR #36414: URL: https://github.com/apache/spark/pull/36414#issuecomment-1115427783 May I get a review when you are free? @ueshin @HyukjinKwon @itholic Thank you! -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115410128 BTW I'll merge this in few days to backport to other branches as a minimal fix if there are no objection. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] otterc commented on pull request #36411: [SPARK-39072][SHUFFLE]Fast fail the remaining push blocks if shuffle …

2022-05-02 Thread GitBox
otterc commented on PR #36411: URL: https://github.com/apache/spark/pull/36411#issuecomment-1115410094 @wankunde Can you please provide more details/logs of the problem that you are trying to solve. In specific, can you provide some logs that exhibit the below > After the shuffle

[GitHub] [spark] otterc commented on pull request #36411: [SPARK-39072][SHUFFLE]Fast fail the remaining push blocks if shuffle …

2022-05-02 Thread GitBox
otterc commented on PR #36411: URL: https://github.com/apache/spark/pull/36411#issuecomment-1115410092 @wankunde Can you please provide more details/logs of the problem that you are trying to solve. In specific, can you provide some logs that exhibit the below > After the shuffle

[GitHub] [spark] holdenk opened a new pull request, #36434: [SPARK-38969] Fix Decom reporting

2022-05-02 Thread GitBox
holdenk opened a new pull request, #36434: URL: https://github.com/apache/spark/pull/36434 ### What changes were proposed in this pull request? Change how we account for executor loss reasons. ### Why are the changes needed? Race condition in executors which decommission

[GitHub] [spark] abhishekd0907 commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled

2022-05-02 Thread GitBox
abhishekd0907 commented on PR #35683: URL: https://github.com/apache/spark/pull/35683#issuecomment-1115371673 @attilapiros Can you please review this PR again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] EnricoMi commented on a diff in pull request #35965: [SPARK-38647][SQL] Add SupportsReportOrdering mix in interface for Scan (DataSourceV2)

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #35965: URL: https://github.com/apache/spark/pull/35965#discussion_r863144390 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala: ## @@ -138,6 +138,12 @@ trait DataSourceV2ScanExecBase extends

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

2022-05-02 Thread GitBox
dongjoon-hyun commented on code in PR #36424: URL: https://github.com/apache/spark/pull/36424#discussion_r863132404 ## core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala: ## @@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf,

[GitHub] [spark] holdenk opened a new pull request, #36433: [SPARK-36462][K8S] Add the ability to selectively disable watching or polling

2022-05-02 Thread GitBox
holdenk opened a new pull request, #36433: URL: https://github.com/apache/spark/pull/36433 ### What changes were proposed in this pull request? Add the ability to selectively disable watching or polling Updated version of https://github.com/apache/spark/pull/34264 ###

[GitHub] [spark] ankurdave commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
ankurdave commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115248094 As @HyukjinKwon had [noted in October 2020](https://github.com/apache/spark/pull/30177/files#r513898999), the `ContextAwareIterator` approach didn't fully solve the problem because the

[GitHub] [spark] AmplabJenkins commented on pull request #36431: [SPARK-38744][SQL][TESTS] Test the error class: NON_LITERAL_PIVOT_VALUES

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36431: URL: https://github.com/apache/spark/pull/36431#issuecomment-1115228468 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #36432: [SPARK-39029][PYTHON][TEST]Improve the test coverage for pyspark/broadcast.py

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36432: URL: https://github.com/apache/spark/pull/36432#issuecomment-1115228427 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] sunchao commented on a diff in pull request #35965: [SPARK-38647][SQL] Add SupportsReportOrdering mix in interface for Scan (DataSourceV2)

2022-05-02 Thread GitBox
sunchao commented on code in PR #35965: URL: https://github.com/apache/spark/pull/35965#discussion_r863083877 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala: ## @@ -138,6 +138,12 @@ trait DataSourceV2ScanExecBase extends

[GitHub] [spark] sunchao commented on a diff in pull request #36427: [SPARK-39086][SQL] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
sunchao commented on code in PR #36427: URL: https://github.com/apache/spark/pull/36427#discussion_r863072081 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala: ## @@ -208,6 +208,9 @@ object ParquetUtils { case st: StructType

[GitHub] [spark] MaxGekk commented on pull request #36426: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

2022-05-02 Thread GitBox
MaxGekk commented on PR #36426: URL: https://github.com/apache/spark/pull/36426#issuecomment-1115182940 @cloud-fan @HyukjinKwon @gengliangwang @srielau Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] MaxGekk commented on pull request #36428: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread GitBox
MaxGekk commented on PR #36428: URL: https://github.com/apache/spark/pull/36428#issuecomment-1115180520 @cloud-fan @HyukjinKwon @gengliangwang @srielau Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] ueshin commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
ueshin commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1115178840 I remember there were some more fixes after mine. @ankurdave Could you also take a look at this? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] minyyy commented on pull request #36121: [SPARK-38836][SQL] Improve the performance of ExpressionSet

2022-05-02 Thread GitBox
minyyy commented on PR #36121: URL: https://github.com/apache/spark/pull/36121#issuecomment-1115178901 All tests passed. CI was broken when I updated the PR last time. Re-triggered the CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] aokolnychyi commented on pull request #36402: [SPARK-38085][SQL][FOLLOWUP] Do not fail too early for DeleteFromTable

2022-05-02 Thread GitBox
aokolnychyi commented on PR #36402: URL: https://github.com/apache/spark/pull/36402#issuecomment-1115141009 LGTM too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] MyeongKim commented on pull request #33544: [SPARK-34927][INFRA] Support TPCDSQueryBenchmark in Benchmarks

2022-05-02 Thread GitBox
MyeongKim commented on PR #33544: URL: https://github.com/apache/spark/pull/33544#issuecomment-1115128196 > @MyeongKim Any new progress on this? Mind me taking over this issue? Please go ahead! -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] xkrogen commented on pull request #34856: [SPARK-37602][CORE] Add config property to set default Spark listeners

2022-05-02 Thread GitBox
xkrogen commented on PR #34856: URL: https://github.com/apache/spark/pull/34856#issuecomment-1115111927 @cloud-fan (or others), do you have any thoughts on what was shared above? From our side we're open to either moving forward with this PR as-is (just adding logic to have a default for

[GitHub] [spark] dtenedor commented on a diff in pull request #36398: [SPARK-38838][SQL] Refactor ResolveDefaultColumns.scala to simplify helper methods

2022-05-02 Thread GitBox
dtenedor commented on code in PR #36398: URL: https://github.com/apache/spark/pull/36398#discussion_r862996565 ## sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala: ## @@ -976,13 +976,6 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {

[GitHub] [spark] dtenedor commented on a diff in pull request #36398: [SPARK-38838][SQL] Refactor ResolveDefaultColumns.scala to simplify helper methods

2022-05-02 Thread GitBox
dtenedor commented on code in PR #36398: URL: https://github.com/apache/spark/pull/36398#discussion_r862990931 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -50,60 +52,116 @@ import org.apache.spark.sql.types._ case

[GitHub] [spark] srowen commented on pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator`

2022-05-02 Thread GitBox
srowen commented on PR #36403: URL: https://github.com/apache/spark/pull/36403#issuecomment-1115056858 SoftReference still allows it to be reclaimed on a full GC. If that helps, I think it's OK to change, as we do not expect many open iterators at any one time. Does that help the issue

[GitHub] [spark] srowen commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

2022-05-02 Thread GitBox
srowen commented on code in PR #36424: URL: https://github.com/apache/spark/pull/36424#discussion_r862961041 ## core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala: ## @@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf,

[GitHub] [spark] pralabhkumar opened a new pull request, #36432: [SPARK-39029][PYTHON][TEST]Improve the test coverage for pyspark/broadcast.py

2022-05-02 Thread GitBox
pralabhkumar opened a new pull request, #36432: URL: https://github.com/apache/spark/pull/36432 ### What changes were proposed in this pull request? This PR add test cases for broadcast.py ### Why are the changes needed? To cover corner test cases and increase coverage

[GitHub] [spark] gengliangwang commented on a diff in pull request #36398: [SPARK-38838][SQL] Refactor ResolveDefaultColumns.scala to simplify helper methods

2022-05-02 Thread GitBox
gengliangwang commented on code in PR #36398: URL: https://github.com/apache/spark/pull/36398#discussion_r862924482 ## sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala: ## @@ -976,13 +976,6 @@ class InsertSuite extends DataSourceTest with

[GitHub] [spark] gengliangwang commented on a diff in pull request #36398: [SPARK-38838][SQL] Refactor ResolveDefaultColumns.scala to simplify helper methods

2022-05-02 Thread GitBox
gengliangwang commented on code in PR #36398: URL: https://github.com/apache/spark/pull/36398#discussion_r862923822 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala: ## @@ -50,60 +52,116 @@ import org.apache.spark.sql.types._

[GitHub] [spark] LuciferYang commented on pull request #36428: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread GitBox
LuciferYang commented on PR #36428: URL: https://github.com/apache/spark/pull/36428#issuecomment-1114974312 It seems there are relevant UTs need to be fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

2022-05-02 Thread GitBox
tanvn commented on PR #36424: URL: https://github.com/apache/spark/pull/36424#issuecomment-1114972192 @gengliangwang @srowen @dongjoon-hyun @turboFei @vanzin @HeartSaVioR Please take a look when you have time. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] LuciferYang commented on pull request #36431: [SPARK-38744][SQL][TESTS] Test the error class: NON_LITERAL_PIVOT_VALUES

2022-05-02 Thread GitBox
LuciferYang commented on PR #36431: URL: https://github.com/apache/spark/pull/36431#issuecomment-1114971635 GA test failed ``` 2022-05-02T12:32:09.6882052Z - NON_LITERAL_PIVOT_VALUES: literal expressions required for pivot values *** FAILED *** (33 milliseconds)

[GitHub] [spark] LuciferYang commented on pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator`

2022-05-02 Thread GitBox
LuciferYang commented on PR #36403: URL: https://github.com/apache/spark/pull/36403#issuecomment-1114921698 > Change `WeakReference` in `iteratorTracker` to strong reference should avoid the issue I mentioned above, all `LevelDB/RockDBIterator` not explicitly closed in Spark code will be

[GitHub] [spark] LuciferYang commented on pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator`

2022-05-02 Thread GitBox
LuciferYang commented on PR #36403: URL: https://github.com/apache/spark/pull/36403#issuecomment-1114905055 Change `WeakReference` in `iteratorTracker` to strong reference should avoid the issue I mentioned above, all `LevelDB/RockDBIterator` not explicitly closed in Spark code will be

[GitHub] [spark] srowen commented on pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator`

2022-05-02 Thread GitBox
srowen commented on PR #36403: URL: https://github.com/apache/spark/pull/36403#issuecomment-1114808372 I see, you mean there is an actual problem letting it close on finalize. I agree it's risky in any event to hold a lock in finalize. Well, I guess the question is whether or not we more

[GitHub] [spark] panbingkun commented on pull request #36431: [SPARK-38744][SQL][TESTS] Test the error class: NON_LITERAL_PIVOT_VALUES

2022-05-02 Thread GitBox
panbingkun commented on PR #36431: URL: https://github.com/apache/spark/pull/36431#issuecomment-1114772193 Add tests for the error classes PIVOT_VALUE_DATA_TYPE_MISMATCH to QueryCompilationErrorsSuite. Resolved by: https://issues.apache.org/jira/browse/SPARK-38748

[GitHub] [spark] panbingkun opened a new pull request, #36431: [SPARK-38744][SQL][TESTS] Test the error class: NON_LITERAL_PIVOT_VALUES

2022-05-02 Thread GitBox
panbingkun opened a new pull request, #36431: URL: https://github.com/apache/spark/pull/36431 ## What changes were proposed in this pull request? This PR aims to add a test for the error class NON_LITERAL_PIVOT_VALUES to `QueryCompilationErrorsSuite`. ### Why are the changes

[GitHub] [spark] AmplabJenkins commented on pull request #36429: [SPARK-38733][SQL][TESTS] Test the error class: INCOMPATIBLE_DATASOURCE_REGISTER

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36429: URL: https://github.com/apache/spark/pull/36429#issuecomment-1114766743 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36430: URL: https://github.com/apache/spark/pull/36430#issuecomment-1114766715 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] ravwojdyla commented on pull request #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
ravwojdyla commented on PR #36430: URL: https://github.com/apache/spark/pull/36430#issuecomment-1114760687  @HyukjinKwon I'm a little blocked on what's the best way to enforce compatibility of nested columns, do you have any hits please? Also any other comments about the current WIP?

[GitHub] [spark] ravwojdyla opened a new pull request, #36430: [WIP][SPARK-38904] Select by schema

2022-05-02 Thread GitBox
ravwojdyla opened a new pull request, #36430: URL: https://github.com/apache/spark/pull/36430 Almost copy pasting from https://issues.apache.org/jira/browse/SPARK-38904: This PR is related to https://stackoverflow.com/questions/71610435. Let's assume I have a pyspark DataFrame with

[GitHub] [spark] pingsutw commented on pull request #33544: [SPARK-34927][INFRA] Support TPCDSQueryBenchmark in Benchmarks

2022-05-02 Thread GitBox
pingsutw commented on PR #33544: URL: https://github.com/apache/spark/pull/33544#issuecomment-1114745320 @MyeongKim Any new progress on this? Mind me taking over this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] panbingkun opened a new pull request, #36429: [SPARK-38733][SQL][TESTS] Test the error class: INCOMPATIBLE_DATASOURCE_REGISTER

2022-05-02 Thread GitBox
panbingkun opened a new pull request, #36429: URL: https://github.com/apache/spark/pull/36429 ## What changes were proposed in this pull request? This PR aims to add a test for the error class INCOMPATIBLE_DATASOURCE_REGISTER to `QueryExecutionErrorsSuite`. ### Why are the changes

[GitHub] [spark] EnricoMi commented on a diff in pull request #36413: [SPARK-39074][CI] Fail on upload, not download of missing test files

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #36413: URL: https://github.com/apache/spark/pull/36413#discussion_r862705748 ## .github/workflows/build_and_test.yml: ## @@ -280,6 +280,7 @@ jobs: with: name: test-results-${{ matrix.modules }}-${{ matrix.comment }}-${{

[GitHub] [spark] EnricoMi commented on a diff in pull request #35899: [SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #35899: URL: https://github.com/apache/spark/pull/35899#discussion_r862699830 ## sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala: ## @@ -171,6 +171,86 @@ class KeyValueGroupedDataset[K, V] private[sql](

[GitHub] [spark] HyukjinKwon commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1114644298 Will leave it to @ueshin though .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36413: [SPARK-39074][CI] Fail on upload, not download of missing test files

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #36413: URL: https://github.com/apache/spark/pull/36413#discussion_r862695108 ## .github/workflows/build_and_test.yml: ## @@ -280,6 +280,7 @@ jobs: with: name: test-results-${{ matrix.modules }}-${{ matrix.comment }}-${{

[GitHub] [spark] AmplabJenkins commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1114641923 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36424: URL: https://github.com/apache/spark/pull/36424#issuecomment-1114641958 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] EnricoMi commented on a diff in pull request #35965: [SPARK-38647][SQL] Add SupportsReportOrdering mix in interface for Scan (DataSourceV2)

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #35965: URL: https://github.com/apache/spark/pull/35965#discussion_r862688879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala: ## @@ -138,6 +138,12 @@ trait DataSourceV2ScanExecBase extends

[GitHub] [spark] HyukjinKwon closed pull request #36423: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

2022-05-02 Thread GitBox
HyukjinKwon closed pull request #36423: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS URL: https://github.com/apache/spark/pull/36423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] HyukjinKwon commented on pull request #36423: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

2022-05-02 Thread GitBox
HyukjinKwon commented on PR #36423: URL: https://github.com/apache/spark/pull/36423#issuecomment-1114632255 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] MaxGekk opened a new pull request, #36428: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread GitBox
MaxGekk opened a new pull request, #36428: URL: https://github.com/apache/spark/pull/36428 ### What changes were proposed in this pull request? In the PR, I propose to modify error messages of the following error classes: - INVALID_JSON_SCHEMA_MAP_TYPE - INCOMPARABLE_PIVOT_COLUMN

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #33525: [SPARK-35320][SQL] Improve error message for unsupported key types in MapType in from_json expression

2022-05-02 Thread GitBox
HyukjinKwon commented on code in PR #33525: URL: https://github.com/apache/spark/pull/33525#discussion_r862668632 ## sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala: ## @@ -390,11 +390,15 @@ class JsonFunctionsSuite extends QueryTest with

[GitHub] [spark] EnricoMi commented on a diff in pull request #35965: [SPARK-38647][SQL] Add SupportsReportOrdering mix in interface for Scan (DataSourceV2)

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #35965: URL: https://github.com/apache/spark/pull/35965#discussion_r862668354 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsReportOrdering.java: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] EnricoMi commented on a diff in pull request #36413: [SPARK-39074][CI] Fail on upload, not download of missing test files

2022-05-02 Thread GitBox
EnricoMi commented on code in PR #36413: URL: https://github.com/apache/spark/pull/36413#discussion_r862659269 ## .github/workflows/build_and_test.yml: ## @@ -280,6 +280,7 @@ jobs: with: name: test-results-${{ matrix.modules }}-${{ matrix.comment }}-${{

[GitHub] [spark] itholic commented on a diff in pull request #33525: [SPARK-35320][SQL] Improve error message for unsupported key types in MapType in from_json expression

2022-05-02 Thread GitBox
itholic commented on code in PR #33525: URL: https://github.com/apache/spark/pull/33525#discussion_r862656022 ## sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala: ## @@ -390,11 +390,15 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {

[GitHub] [spark] AmplabJenkins commented on pull request #36427: [SPARK-39086] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
AmplabJenkins commented on PR #36427: URL: https://github.com/apache/spark/pull/36427#issuecomment-1114581370 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] sadikovi commented on pull request #36427: [SPARK-39086] Support UDT in Spark Parquet vectorized reader

2022-05-02 Thread GitBox
sadikovi commented on PR #36427: URL: https://github.com/apache/spark/pull/36427#issuecomment-1114574931 @sunchao Can you review the PR? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] senthh commented on pull request #35785: [SPARK-38213][STREAMING] Adding KafkaSink Metrics feature

2022-05-02 Thread GitBox
senthh commented on PR #35785: URL: https://github.com/apache/spark/pull/35785#issuecomment-1114574921 @dongjoon-hyun @dgd-contributor @gaborgsomogyi @squito Could you be kind to review this PR, Please? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] sadikovi opened a new pull request, #36427: [SPARK-39086] Parquet UDT support

2022-05-02 Thread GitBox
sadikovi opened a new pull request, #36427: URL: https://github.com/apache/spark/pull/36427 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] LuciferYang closed pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator`

2022-05-02 Thread GitBox
LuciferYang closed pull request #36403: [SPARK-39063][CORE] Remove `finalize()` method and related codes from `LevelDB/RocksDBIterator` URL: https://github.com/apache/spark/pull/36403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] MaxGekk opened a new pull request, #36426: [SPARK-39085][SQL] Move error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

2022-05-02 Thread GitBox
MaxGekk opened a new pull request, #36426: URL: https://github.com/apache/spark/pull/36426 ### What changes were proposed in this pull request? In the PR, I propose to create two new sub-classes of the error class `INCONSISTENT_BEHAVIOR_CROSS_VERSION`: - READ_ANCIENT_DATETIME -

[GitHub] [spark] sadikovi commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread GitBox
sadikovi commented on PR #36425: URL: https://github.com/apache/spark/pull/36425#issuecomment-1114523613 @HyukjinKwon and @ueshin Can you review the PR? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL