[GitHub] [spark] bozhang2820 opened a new pull request, #37931: [WIP][SPARK-40488] Do not wrap exceptions thrown in FileFormatWriter.write with SparkException

2022-09-19 Thread GitBox
bozhang2820 opened a new pull request, #37931: URL: https://github.com/apache/spark/pull/37931 ### What changes were proposed in this pull request? Exceptions thrown in `FileFormatWriter.write` are wrapped with `SparkException("Job aborted.")`, which provides little extra

[GitHub] [spark] otterc commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-19 Thread GitBox
otterc commented on code in PR #37533: URL: https://github.com/apache/spark/pull/37533#discussion_r974396707 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2309,7 +2309,18 @@ package object config { " shuffle is enabled.")

[GitHub] [spark] zhengruifeng opened a new pull request, #37929: [SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread GitBox
zhengruifeng opened a new pull request, #37929: URL: https://github.com/apache/spark/pull/37929 ### What changes were proposed in this pull request? 1. extract the computation of `DataFrame.corr` into `correlation.py`, so it can be reused in `DataFrame.corrwith`/`Groupby.corr`/etc;

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup

2022-09-19 Thread GitBox
ryan-johnson-databricks commented on code in PR #37879: URL: https://github.com/apache/spark/pull/37879#discussion_r974198718 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala: ## @@ -28,8 +28,14 @@ class ResolveCatalogs(val

[GitHub] [spark] cloud-fan commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-19 Thread GitBox
cloud-fan commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1250982553 SGTM! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] tgravescs commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
tgravescs commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974308904 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/command/SparkConnectCommandPlanner.scala: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] HyukjinKwon closed pull request #37920: [SPARK-40413][SQL] Fix `Column.isin` return null

2022-09-19 Thread GitBox
HyukjinKwon closed pull request #37920: [SPARK-40413][SQL] Fix `Column.isin` return null URL: https://github.com/apache/spark/pull/37920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] itholic commented on a diff in pull request #37873: [SPARK-40419][SQL][TESTS] Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread GitBox
itholic commented on code in PR #37873: URL: https://github.com/apache/spark/pull/37873#discussion_r974111272 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala: ## @@ -113,10 +113,10 @@ import org.apache.spark.util.Utils * - Scala UDF test case with a

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-19 Thread GitBox
WeichenXu123 commented on code in PR #37918: URL: https://github.com/apache/spark/pull/37918#discussion_r974197580 ## mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala: ## @@ -496,18 +499,23 @@ class ALSModel private[ml] ( .iterator.map { j =>

[GitHub] [spark] srowen commented on pull request #37743: [SPARK-40294][SQL] Fix repeat calls to `PartitionReader.hasNext` timing out

2022-09-19 Thread GitBox
srowen commented on PR #37743: URL: https://github.com/apache/spark/pull/37743#issuecomment-1250963427 Closed in favor of https://github.com/apache/spark/pull/37743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] srowen closed pull request #37743: [SPARK-40294][SQL] Fix repeat calls to `PartitionReader.hasNext` timing out

2022-09-19 Thread GitBox
srowen closed pull request #37743: [SPARK-40294][SQL] Fix repeat calls to `PartitionReader.hasNext` timing out URL: https://github.com/apache/spark/pull/37743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] cloud-fan commented on pull request #37900: [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread GitBox
cloud-fan commented on PR #37900: URL: https://github.com/apache/spark/pull/37900#issuecomment-1250979410 Since https://github.com/apache/spark/pull/37743 is inactive, I'll merge this PR but assign the JIRA ticket to that PR author to share credits. -- This is an automated message from

[GitHub] [spark] cloud-fan commented on pull request #37900: [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread GitBox
cloud-fan commented on PR #37900: URL: https://github.com/apache/spark/pull/37900#issuecomment-1250979859 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] cloud-fan closed pull request #37900: [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread GitBox
cloud-fan closed pull request #37900: [SPARK-40456][SQL] PartitionIterator.hasNext should be cheap to call repeatedly URL: https://github.com/apache/spark/pull/37900 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] MaxGekk commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-19 Thread GitBox
MaxGekk commented on PR #37916: URL: https://github.com/apache/spark/pull/37916#issuecomment-1251016904 > Can we update core/src/main/resources/error/README.md to mention this special error class naming prefix? Let me do that in the next PR. -- This is an automated message

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1251149875 For clarification, I am fine with reverting the whole component if the plan isn't followed in the future. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974072051 ## python/pyspark/sql/pandas/_typing/__init__.pyi: ## @@ -256,6 +258,10 @@ PandasGroupedMapFunction = Union[ Callable[[Any, DataFrameLike], DataFrameLike], ]

[GitHub] [spark] roczei commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-19 Thread GitBox
roczei commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1250958346 > Is this a common behavior in other databases? @cloud-fan Good question. The reason that we cannot delete the user specified default database because we have the following if

[GitHub] [spark] cloud-fan commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-19 Thread GitBox
cloud-fan commented on PR #37916: URL: https://github.com/apache/spark/pull/37916#issuecomment-1250975885 Can we update `core/src/main/resources/error/README.md` to mention this special error class naming prefix? -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] WeichenXu123 commented on pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

2022-09-19 Thread GitBox
WeichenXu123 commented on PR #37855: URL: https://github.com/apache/spark/pull/37855#issuecomment-1250998854 @wbo4958 Issue: The xgboost code uses rdd barrier mode, but barrier mode does not work with `coalesce` operator. -- This is an automated message from the Apache Git

[GitHub] [spark] zzzzming95 commented on pull request #37920: [SPARK-40413][SQL] Fix `Column.isin` return null

2022-09-19 Thread GitBox
ming95 commented on PR #37920: URL: https://github.com/apache/spark/pull/37920#issuecomment-1250998408 > `null` comparison should return `null` which I believe is the standard behaviour from ANSI. I tested it in hive and mysql respectively, and it does return null. The pr will

[GitHub] [spark] HyukjinKwon commented on pull request #37873: [SPARK-40419][SQL][TESTS] Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread GitBox
HyukjinKwon commented on PR #37873: URL: https://github.com/apache/spark/pull/37873#issuecomment-1250864061 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon closed pull request #37873: [SPARK-40419][SQL][TESTS] Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread GitBox
HyukjinKwon closed pull request #37873: [SPARK-40419][SQL][TESTS] Integrate Grouped Aggregate Pandas UDFs into *.sql test cases URL: https://github.com/apache/spark/pull/37873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-19 Thread GitBox
WeichenXu123 commented on code in PR #37918: URL: https://github.com/apache/spark/pull/37918#discussion_r974197580 ## mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala: ## @@ -496,18 +499,23 @@ class ALSModel private[ml] ( .iterator.map { j =>

[GitHub] [spark] cloud-fan commented on a diff in pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup

2022-09-19 Thread GitBox
cloud-fan commented on code in PR #37879: URL: https://github.com/apache/spark/pull/37879#discussion_r974211151 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala: ## @@ -685,15 +685,15 @@ class DDLParserSuite extends AnalysisTest {

[GitHub] [spark] cloud-fan commented on a diff in pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup

2022-09-19 Thread GitBox
cloud-fan commented on code in PR #37879: URL: https://github.com/apache/spark/pull/37879#discussion_r974211738 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala: ## @@ -28,8 +28,14 @@ class ResolveCatalogs(val catalogManager:

[GitHub] [spark] clementguillot commented on pull request #33154: [SPARK-35949][CORE]Add `keep-spark-context-alive` arg for to prevent closing spark context after invoking main for some case

2022-09-19 Thread GitBox
clementguillot commented on PR #33154: URL: https://github.com/apache/spark/pull/33154#issuecomment-1251015439 Hello @sunpe, thank you for your very fast answer. Please let me give you some more context, I am using Spark v3.3.0 in K8s using [Spark on K8S operator](

[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-19 Thread GitBox
MaxGekk commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r974242158 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -2360,7 +2360,10 @@ class AstBuilder extends

[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-19 Thread GitBox
MaxGekk commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r974242158 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -2360,7 +2360,10 @@ class AstBuilder extends

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1251115196 Thanks for your feedback. Yes, it's pretty much decoupled, and I believe this doesn't affect anything to other components. Sure, I will leave it out for more days. -- This is an

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-19 Thread GitBox
WeichenXu123 commented on code in PR #37918: URL: https://github.com/apache/spark/pull/37918#discussion_r974195241 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala: ## @@ -194,3 +194,44 @@ case class CollectSet( override

[GitHub] [spark] cloud-fan commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-19 Thread GitBox
cloud-fan commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r974207860 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -2360,7 +2360,10 @@ class AstBuilder extends

[GitHub] [spark] tgravescs commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
tgravescs commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1251105175 I would be ok with merging a minimal working version as long as it doesn't impact many other components and destabilize the builds and other developers activities. If it doesn't fit

[GitHub] [spark] thomasg19930417 commented on pull request #34464: [SPARK-37193][SQL] DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2022-09-19 Thread GitBox
thomasg19930417 commented on PR #34464: URL: https://github.com/apache/spark/pull/34464#issuecomment-1250835784 @ekoifman hi, when LOJ and LHS has many empty partition ,why inner join not demote broadcast else if (manyEmptyInOther && canBroadcastPlan) {

[GitHub] [spark] xclyfe opened a new pull request, #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread GitBox
xclyfe opened a new pull request, #37930: URL: https://github.com/apache/spark/pull/37930 ### What changes were proposed in this pull request? Currently, the defaultJoin method in BroadcastNestedLoopJoinExec collects notMatchedBroadcastRows firstly, then collects matchedStreamRows.

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1251143797 There is a testing plan ([Spark Connect API Testing Plan](https://docs.google.com/document/d/1n6EgS5vcmbwJUs5KGX4PzjKZVcSKd0qf0gLNZ6NFvOE/edit?usp=sharing)) that I and @amaliujia

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974072051 ## python/pyspark/sql/pandas/_typing/__init__.pyi: ## @@ -256,6 +258,10 @@ PandasGroupedMapFunction = Union[ Callable[[Any, DataFrameLike], DataFrameLike], ]

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37873: [SPARK-40419][SQL][TESTS] Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-19 Thread GitBox
HyukjinKwon commented on code in PR #37873: URL: https://github.com/apache/spark/pull/37873#discussion_r974107717 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala: ## @@ -113,10 +113,10 @@ import org.apache.spark.util.Utils * - Scala UDF test case with

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974486551 ## docs/configuration.md: ## @@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful 2.2.0 + +

[GitHub] [spark] mridulm commented on pull request #37922: [WIP][SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-09-19 Thread GitBox
mridulm commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1251349810 > The push-based shuffle service will auto clean up the old shuffle merge data Consider the case I mentioned above - stage retry for an `INDETERMINATE` stage. We cleanup

[GitHub] [spark] kazuyukitanimura commented on pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on PR #37934: URL: https://github.com/apache/spark/pull/37934#issuecomment-1251489126 cc @sunchao @viirya @flyrain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] xinrong-meng commented on pull request #37908: [SPARK-40196][PS][FOLLOWUP] `SF.lit` -> `F.lit` in `window.quantile`

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37908: URL: https://github.com/apache/spark/pull/37908#issuecomment-1251489189 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974626628 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] mridulm commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
mridulm commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974600104 ## docs/configuration.md: ## @@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful 2.2.0 + +

[GitHub] [spark] AmplabJenkins commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37924: URL: https://github.com/apache/spark/pull/37924#issuecomment-1251551174 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37923: URL: https://github.com/apache/spark/pull/37923#issuecomment-1251551212 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #37922: [WIP][SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1251551259 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974424083 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/planner/SparkConnectPlanner.scala: ## @@ -0,0 +1,275 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] amaliujia commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
amaliujia commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1251291583 @tgravescs I will follow up on the testing plan doc to address your comments. Please feel free to bring up anything in the doc or here. -- This is an automated message from the

[GitHub] [spark] Yaohua628 commented on pull request #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
Yaohua628 commented on PR #37932: URL: https://github.com/apache/spark/pull/37932#issuecomment-1251310801 cc @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] Yaohua628 opened a new pull request, #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
Yaohua628 opened a new pull request, #37932: URL: https://github.com/apache/spark/pull/37932 ### What changes were proposed in this pull request? Cherry-picked from #37905 Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata`

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974517188 ## python/pyspark/worker.py: ## @@ -361,6 +429,32 @@ def read_udfs(pickleSer, infile, eval_type): if eval_type ==

[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-19 Thread GitBox
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r974575955 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -730,6 +729,13 @@ trait CheckAnalysis extends PredicateHelper with

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974602945 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/planner/SparkConnectPlanner.scala: ## @@ -0,0 +1,275 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065 Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where _user code_ could introduce a null. This

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r974499166 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala: ## @@ -252,28 +267,44 @@ object

[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster m

2022-09-19 Thread GitBox
pralabhkumar commented on PR #37417: URL: https://github.com/apache/spark/pull/37417#issuecomment-1251334364 @dongjoon-hyun , Have incorporated all the review comments , please look into the same. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] ayudovin commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
ayudovin commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974514017 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,98 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def

[GitHub] [spark] huanliwang-db commented on a diff in pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable

2022-09-19 Thread GitBox
huanliwang-db commented on code in PR #37917: URL: https://github.com/apache/spark/pull/37917#discussion_r974439525 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -1599,9 +1599,18 @@ private[sql] object QueryExecutionErrors extends

[GitHub] [spark] huanliwang-db commented on a diff in pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable

2022-09-19 Thread GitBox
huanliwang-db commented on code in PR #37917: URL: https://github.com/apache/spark/pull/37917#discussion_r974439760 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -1599,9 +1599,18 @@ private[sql] object QueryExecutionErrors extends

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974479715 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -230,6 +230,9 @@ private[spark] class DAGScheduler(

[GitHub] [spark] Yaohua628 commented on pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-19 Thread GitBox
Yaohua628 commented on PR #37905: URL: https://github.com/apache/spark/pull/37905#issuecomment-1251311583 > There's conflict in branch-3.3. @Yaohua628 Could you please craft a PR for branch-3.3? Thanks in advance! Done! https://github.com/apache/spark/pull/37932 - Thank you --

[GitHub] [spark] AmplabJenkins commented on pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37930: URL: https://github.com/apache/spark/pull/37930#issuecomment-1251324416 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2022-09-19 Thread GitBox
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1251427562 @gaborgsomogyi @dongjoon-hyun @HyukjinKwon Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] AmplabJenkins commented on pull request #37928: [SPARK-40485][SQL] Extend the partitioning options of the JDBC data source

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37928: URL: https://github.com/apache/spark/pull/37928#issuecomment-1251440593 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974599749 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/command/SparkConnectCommandPlanner.scala: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1251487641 Thank you @HyukjinKwon @zhengruifeng @Yikun for taking care of the merging! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974649030 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974680745 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] wbo4958 commented on pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

2022-09-19 Thread GitBox
wbo4958 commented on PR #37855: URL: https://github.com/apache/spark/pull/37855#issuecomment-1251592287 > > @wbo4958 > > Issue: The xgboost code uses rdd barrier mode, but barrier mode does not work with `coalesce` operator. @mridulm just suggested using

[GitHub] [spark] chaoqin-li1123 opened a new pull request, #37935: Do maintenance before streaming StateStore unload

2022-09-19 Thread GitBox
chaoqin-li1123 opened a new pull request, #37935: URL: https://github.com/apache/spark/pull/37935 ### What changes were proposed in this pull request? Before unload of a StateStore, perform a cleanup. ### Why are the changes needed? Current the maintenance of

[GitHub] [spark] MaxGekk commented on pull request #37921: [SPARK-40479][SQL] Migrate unexpected input type error to an error class

2022-09-19 Thread GitBox
MaxGekk commented on PR #37921: URL: https://github.com/apache/spark/pull/37921#issuecomment-1251528429 @srielau @anchovYu Could you take a look at the PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] viirya closed pull request #37926: [SPARK-40484][BUILD] Upgrade log4j2 to 2.19.0

2022-09-19 Thread GitBox
viirya closed pull request #37926: [SPARK-40484][BUILD] Upgrade log4j2 to 2.19.0 URL: https://github.com/apache/spark/pull/37926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] viirya commented on pull request #37926: [SPARK-40484][BUILD] Upgrade log4j2 to 2.19.0

2022-09-19 Thread GitBox
viirya commented on PR #37926: URL: https://github.com/apache/spark/pull/37926#issuecomment-1251278975 Thanks. Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] roczei commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-19 Thread GitBox
roczei commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1251364841 Thanks @cloud-fan, I have implemented this and all tests passed. As I see we have resolved all of your feedbacks. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974672806 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2705,6 +2705,44 @@ object SQLConf { .booleanConf

[GitHub] [spark] dongjoon-hyun closed pull request #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages

2022-09-19 Thread GitBox
dongjoon-hyun closed pull request #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages URL: https://github.com/apache/spark/pull/37424 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974483075 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1860,8 +1863,17 @@ private[spark] class DAGScheduler( s"(attempt

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974490653 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1860,8 +1863,17 @@ private[spark] class DAGScheduler( s"(attempt

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974491025 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -2159,6 +2171,16 @@ private[spark] class DAGScheduler( } } + private def

[GitHub] [spark] xiaonanyang-db opened a new pull request, #37933: SPARK-40474 Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
xiaonanyang-db opened a new pull request, #37933: URL: https://github.com/apache/spark/pull/37933 ### What changes were proposed in this pull request? Adjust part of changes in https://github.com/apache/spark/pull/36871. In the pr above, we introduced the support of date

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974611991 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/planner/SparkConnectPlanner.scala: ## @@ -0,0 +1,275 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] kazuyukitanimura opened a new pull request, #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura opened a new pull request, #37934: URL: https://github.com/apache/spark/pull/37934 ### What changes were proposed in this pull request? This PR proposes to support `NullType` in `ColumnarBatchRow`. ### Why are the changes needed? `ColumnarBatchRow.get()`

[GitHub] [spark] xinrong-meng commented on pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37912: URL: https://github.com/apache/spark/pull/37912#issuecomment-1251488032 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974646917 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974772538 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974775400 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974780246 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +994,101 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +

[GitHub] [spark] LuciferYang commented on a diff in pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang commented on code in PR #37938: URL: https://github.com/apache/spark/pull/37938#discussion_r974854247 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -237,6 +241,10 @@ protected void serviceInit(Configuration

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974854298 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2705,6 +2705,44 @@ object SQLConf { .booleanConf

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974858058 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HyukjinKwon opened a new pull request, #37939: [MINOR][DOCS][PYTHON] Document datetime.timedelta <> DayTimeIntervalType

2022-09-19 Thread GitBox
HyukjinKwon opened a new pull request, #37939: URL: https://github.com/apache/spark/pull/37939 ### What changes were proposed in this pull request? This PR proposes to document datetime.timedelta support in PySpark in SQL DataType reference page. This support was added in SPARK-37275

[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
xiaonanyang-db commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1251869330 > Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR

[GitHub] [spark] WweiL opened a new pull request, #37936: Add additional tests to StreamingSessionWindowSuite

2022-09-19 Thread GitBox
WweiL opened a new pull request, #37936: URL: https://github.com/apache/spark/pull/37936 ## What changes were proposed in this pull request? Add complex tests to `StreamingSessionWindowSuite`. Concretely, I created two helper functions, - one is called

[GitHub] [spark] HeartSaVioR closed pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable

2022-09-19 Thread GitBox
HeartSaVioR closed pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable URL: https://github.com/apache/spark/pull/37917 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974784396 ## python/pyspark/sql/pandas/serializers.py: ## @@ -371,3 +375,292 @@ def load_stream(self, stream): raise ValueError(

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974803979 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
Yikun commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974808692 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +994,101 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def

[GitHub] [spark] Yikun commented on pull request #36087: [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores`

2022-09-19 Thread GitBox
Yikun commented on PR #36087: URL: https://github.com/apache/spark/pull/36087#issuecomment-1251756266 @dongjoon-hyun Could we backport this to branch-3.3, this will very help to run branch-3.3 K8S in github action. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] LuciferYang commented on a diff in pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang commented on code in PR #37938: URL: https://github.com/apache/spark/pull/37938#discussion_r974846876 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -237,6 +241,10 @@ protected void serviceInit(Configuration

  1   2   >