[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974908544 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) =>

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974909141 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e)

[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974908544 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) =>

[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
xiaonanyang-db commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1251869330 > Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR

[GitHub] [spark] LuciferYang commented on pull request #37940: [SPARK-40494][CORE][SQL][MLLIB] Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread GitBox
LuciferYang commented on PR #37940: URL: https://github.com/apache/spark/pull/37940#issuecomment-1251869079 Test the following code with input size `1,5,10,20,50,100,150,200,300,400,500,1000,5000,1,2` ``` def testZipWithIndexToMap(valuesPerIteration: Int, collectionSize:

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974906420 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala: ## @@ -311,6 +323,56 @@ object UnsupportedOperationChecker

[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974905472 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) =>

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974905452 ## python/pyspark/worker.py: ## @@ -207,6 +209,65 @@ def wrapped(key_series, value_series): return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]

[GitHub] [spark] dongjoon-hyun commented on pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread GitBox
dongjoon-hyun commented on PR #37729: URL: https://github.com/apache/spark/pull/37729#issuecomment-1251865262 Thank you, @wangyum . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] LuciferYang opened a new pull request, #37940: [SPARK-40494][CORE][SQL][MLLIB] Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-19 Thread GitBox
LuciferYang opened a new pull request, #37940: URL: https://github.com/apache/spark/pull/37940 ### What changes were proposed in this pull request? Similar as https://github.com/apache/spark/pull/37876, this pr introduce a new toMap method to `o.a.spark.util.collection.Utils`, use

[GitHub] [spark] wangyum commented on pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread GitBox
wangyum commented on PR #37729: URL: https://github.com/apache/spark/pull/37729#issuecomment-1251857340 OK. https://issues.apache.org/jira/browse/SPARK-40493 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] beliefer commented on pull request #37937: [SPARK-40491][SQL] Expose a jdbcRDD function in SparkContext

2022-09-19 Thread GitBox
beliefer commented on PR #37937: URL: https://github.com/apache/spark/pull/37937#issuecomment-1251853764 > Hm, why do we need this? Can't we do `spark.read.jdbc(...).rdd` or `toDS`? I know. This PR just follows the legacy document of `JdbcRDD`. If we don't need the change, we may

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974893716 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2221,6 +2221,14 @@ package object config { .checkValue(_ >= 0, "needs to be a

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974893716 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2221,6 +2221,14 @@ package object config { .checkValue(_ >= 0, "needs to be a

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974891717 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends

[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-19 Thread GitBox
cloud-fan commented on code in PR #37679: URL: https://github.com/apache/spark/pull/37679#discussion_r974888079 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -48,9 +48,6 @@ import org.apache.spark.sql.types.StructType import

[GitHub] [spark] dongjoon-hyun commented on pull request #36096: [SPARK-38803][K8S][TESTS] Lower minio cpu to 250m (0.25) from 1 in K8s IT

2022-09-19 Thread GitBox
dongjoon-hyun commented on PR #36096: URL: https://github.com/apache/spark/pull/36096#issuecomment-1251844564 This test commit is backported to branch-3.3 according to the community request, https://github.com/apache/spark/pull/36087#issuecomment-1251757187 . -- This is an automated

[GitHub] [spark] dongjoon-hyun commented on pull request #36087: [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores`

2022-09-19 Thread GitBox
dongjoon-hyun commented on PR #36087: URL: https://github.com/apache/spark/pull/36087#issuecomment-1251843627 Sure, @Yikun . This test commit is backported to branch-3.3 according to the community request. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] sunpe commented on pull request #33154: [SPARK-35949][CORE]Add `keep-spark-context-alive` arg for to prevent closing spark context after invoking main for some case

2022-09-19 Thread GitBox
sunpe commented on PR #33154: URL: https://github.com/apache/spark/pull/33154#issuecomment-1251836832 > Hello @sunpe, thank you for your very fast answer. > > Please let me give you some more context, I am using Spark v3.3.0 in K8s using [Spark on K8S

[GitHub] [spark] HyukjinKwon opened a new pull request, #37939: [MINOR][DOCS][PYTHON] Document datetime.timedelta <> DayTimeIntervalType

2022-09-19 Thread GitBox
HyukjinKwon opened a new pull request, #37939: URL: https://github.com/apache/spark/pull/37939 ### What changes were proposed in this pull request? This PR proposes to document datetime.timedelta support in PySpark in SQL DataType reference page. This support was added in SPARK-37275

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HyukjinKwon commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974870249 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -216,6 +218,105 @@ def applyInPandas( jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r97486 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HyukjinKwon closed pull request #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
HyukjinKwon closed pull request #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata URL: https://github.com/apache/spark/pull/37932 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] HyukjinKwon commented on pull request #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
HyukjinKwon commented on PR #37932: URL: https://github.com/apache/spark/pull/37932#issuecomment-1251799957 Merged to branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974858058 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974858058 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974856402 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStateWriter.scala: ## @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974854298 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2705,6 +2705,44 @@ object SQLConf { .booleanConf

[GitHub] [spark] LuciferYang commented on a diff in pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang commented on code in PR #37938: URL: https://github.com/apache/spark/pull/37938#discussion_r974854247 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -237,6 +241,10 @@ protected void serviceInit(Configuration

[GitHub] [spark] Yikun commented on pull request #36087: [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores`

2022-09-19 Thread GitBox
Yikun commented on PR #36087: URL: https://github.com/apache/spark/pull/36087#issuecomment-1251792098 Here is a simple demo to show why we need them: https://github.com/Yikun/spark-docker/pull/5 - docker image build with tag v3.3.0 - test with 3.3.0 K8S IT in github action -

[GitHub] [spark] LuciferYang commented on a diff in pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang commented on code in PR #37938: URL: https://github.com/apache/spark/pull/37938#discussion_r974846876 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -237,6 +241,10 @@ protected void serviceInit(Configuration

[GitHub] [spark] LuciferYang commented on pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang commented on PR #37938: URL: https://github.com/apache/spark/pull/37938#issuecomment-1251782282 cc @tgravescs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] LuciferYang opened a new pull request, #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-19 Thread GitBox
LuciferYang opened a new pull request, #37938: URL: https://github.com/apache/spark/pull/37938 ### What changes were proposed in this pull request? After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn

[GitHub] [spark] beliefer opened a new pull request, #37937: [SPARK-40491][SQL] Expose a jdbcRDD function in SparkContext

2022-09-19 Thread GitBox
beliefer opened a new pull request, #37937: URL: https://github.com/apache/spark/pull/37937 ### What changes were proposed in this pull request? According to the legacy document of `JdbcRDD`, we need to expose a jdbcRDD function in `SparkContext`. ### Why are the changes

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974707164 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-19 Thread GitBox
zhengruifeng commented on code in PR #37918: URL: https://github.com/apache/spark/pull/37918#discussion_r974838677 ## mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala: ## @@ -496,18 +499,23 @@ class ALSModel private[ml] ( .iterator.map { j =>

[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-19 Thread GitBox
gengliangwang commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r974833681 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3911,6 +3911,15 @@ object SQLConf {

[GitHub] [spark] LuciferYang commented on pull request #37926: [SPARK-40484][BUILD] Upgrade log4j2 to 2.19.0

2022-09-19 Thread GitBox
LuciferYang commented on PR #37926: URL: https://github.com/apache/spark/pull/37926#issuecomment-1251761360 thanks @viirya @srowen @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] Yikun commented on pull request #36087: [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores`

2022-09-19 Thread GitBox
Yikun commented on PR #36087: URL: https://github.com/apache/spark/pull/36087#issuecomment-1251757187 And also this https://github.com/apache/spark/pull/36096 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] Yikun commented on pull request #36087: [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores`

2022-09-19 Thread GitBox
Yikun commented on PR #36087: URL: https://github.com/apache/spark/pull/36087#issuecomment-1251756266 @dongjoon-hyun Could we backport this to branch-3.3, this will very help to run branch-3.3 K8S in github action. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] zhengruifeng commented on pull request #37929: [SPARK-40486][PS] Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-19 Thread GitBox
zhengruifeng commented on PR #37929: URL: https://github.com/apache/spark/pull/37929#issuecomment-1251750796 cc @itholic @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
Yikun commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974808692 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +994,101 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974805894 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974803979 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974798726 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] warrenzhu25 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
warrenzhu25 commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974793372 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1860,8 +1867,18 @@ private[spark] class DAGScheduler( s"(attempt

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974784396 ## python/pyspark/sql/pandas/serializers.py: ## @@ -371,3 +375,292 @@ def load_stream(self, stream): raise ValueError(

[GitHub] [spark] viirya commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
viirya commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974783229 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends QueryTest with

[GitHub] [spark] zhengruifeng commented on pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
zhengruifeng commented on PR #37923: URL: https://github.com/apache/spark/pull/37923#issuecomment-1251698604 ``` Oh no!    2 files would be reformatted, 352 files would be left unchanged. Please run 'dev/reformat-python' script. 1 Error: Process completed with exit code 1.

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974780246 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +994,101 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974779483 ## python/pyspark/pandas/groupby.py: ## @@ -44,6 +43,7 @@ ) import warnings +import numpy as np Review Comment: `numpy` in the docstring was imported in

[GitHub] [spark] chaoqin-li1123 commented on pull request #37935: Do maintenance before streaming StateStore unload

2022-09-19 Thread GitBox
chaoqin-li1123 commented on PR #37935: URL: https://github.com/apache/spark/pull/37935#issuecomment-1251692899 @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974775400 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends

[GitHub] [spark] kazuyukitanimura commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974772538 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends

[GitHub] [spark] HeartSaVioR closed pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable

2022-09-19 Thread GitBox
HeartSaVioR closed pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable URL: https://github.com/apache/spark/pull/37917 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] HeartSaVioR commented on pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled while DSv1 is not avaliable

2022-09-19 Thread GitBox
HeartSaVioR commented on PR #37917: URL: https://github.com/apache/spark/pull/37917#issuecomment-1251667880 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] WweiL opened a new pull request, #37936: Add additional tests to StreamingSessionWindowSuite

2022-09-19 Thread GitBox
WweiL opened a new pull request, #37936: URL: https://github.com/apache/spark/pull/37936 ## What changes were proposed in this pull request? Add complex tests to `StreamingSessionWindowSuite`. Concretely, I created two helper functions, - one is called

[GitHub] [spark] mengxr commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
mengxr commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974747883 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] dongjoon-hyun commented on pull request #37729: Revert "[SPARK-33861][SQL] Simplify conditional in predicate"

2022-09-19 Thread GitBox
dongjoon-hyun commented on PR #37729: URL: https://github.com/apache/spark/pull/37729#issuecomment-1251641154 Sorry, but I missed that this is an ancient patch. To @wangyum , we need a new JIRA when we revert already released patches. -- This is an automated message from the Apache Git

[GitHub] [spark] viirya commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
viirya commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974736080 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends QueryTest with

[GitHub] [spark] viirya commented on a diff in pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
viirya commented on code in PR #37934: URL: https://github.com/apache/spark/pull/37934#discussion_r974734194 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala: ## @@ -1461,10 +1462,7 @@ class ParquetIOSuite extends QueryTest with

[GitHub] [spark] chaoqin-li1123 opened a new pull request, #37935: Do maintenance before streaming StateStore unload

2022-09-19 Thread GitBox
chaoqin-li1123 opened a new pull request, #37935: URL: https://github.com/apache/spark/pull/37935 ### What changes were proposed in this pull request? Before unload of a StateStore, perform a cleanup. ### Why are the changes needed? Current the maintenance of

[GitHub] [spark] wbo4958 commented on pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

2022-09-19 Thread GitBox
wbo4958 commented on PR #37855: URL: https://github.com/apache/spark/pull/37855#issuecomment-1251592287 > > @wbo4958 > > Issue: The xgboost code uses rdd barrier mode, but barrier mode does not work with `coalesce` operator. @mridulm just suggested using

[GitHub] [spark] dongjoon-hyun closed pull request #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages

2022-09-19 Thread GitBox
dongjoon-hyun closed pull request #37424: [SPARK-39991][SQL][AQE] Use available column statistics from completed query stages URL: https://github.com/apache/spark/pull/37424 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] AmplabJenkins commented on pull request #37922: [WIP][SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1251551259 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37923: URL: https://github.com/apache/spark/pull/37923#issuecomment-1251551212 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37924: URL: https://github.com/apache/spark/pull/37924#issuecomment-1251551174 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974680745 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala: ## @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974672806 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2705,6 +2705,44 @@ object SQLConf { .booleanConf

[GitHub] [spark] MaxGekk commented on pull request #37921: [SPARK-40479][SQL] Migrate unexpected input type error to an error class

2022-09-19 Thread GitBox
MaxGekk commented on PR #37921: URL: https://github.com/apache/spark/pull/37921#issuecomment-1251528429 @srielau @anchovYu Could you take a look at the PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974649030 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974646917 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] xinrong-meng commented on pull request #37908: [SPARK-40196][PS][FOLLOWUP] `SF.lit` -> `F.lit` in `window.quantile`

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37908: URL: https://github.com/apache/spark/pull/37908#issuecomment-1251489189 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-19 Thread GitBox
leewyang commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r974626628 ## python/pyspark/ml/functions.py: ## @@ -106,6 +111,167 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] kazuyukitanimura commented on pull request #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura commented on PR #37934: URL: https://github.com/apache/spark/pull/37934#issuecomment-1251489126 cc @sunchao @viirya @flyrain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] xinrong-meng commented on pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37912: URL: https://github.com/apache/spark/pull/37912#issuecomment-1251488032 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-19 Thread GitBox
xinrong-meng commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1251487641 Thank you @HyukjinKwon @zhengruifeng @Yikun for taking care of the merging! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] kazuyukitanimura opened a new pull request, #37934: [SPARK-40477][SQL] Support `NullType` in `ColumnarBatchRow`

2022-09-19 Thread GitBox
kazuyukitanimura opened a new pull request, #37934: URL: https://github.com/apache/spark/pull/37934 ### What changes were proposed in this pull request? This PR proposes to support `NullType` in `ColumnarBatchRow`. ### Why are the changes needed? `ColumnarBatchRow.get()`

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974611991 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/planner/SparkConnectPlanner.scala: ## @@ -0,0 +1,275 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] mridulm commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
mridulm commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974600104 ## docs/configuration.md: ## @@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful 2.2.0 + +

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974602945 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/planner/SparkConnectPlanner.scala: ## @@ -0,0 +1,275 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] grundprinzip commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-19 Thread GitBox
grundprinzip commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r974599749 ## connect/src/main/scala/org/apache/spark/sql/sparkconnect/command/SparkConnectCommandPlanner.scala: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] AmplabJenkins commented on pull request #37928: [SPARK-40485][SQL] Extend the partitioning options of the JDBC data source

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37928: URL: https://github.com/apache/spark/pull/37928#issuecomment-1251440593 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2022-09-19 Thread GitBox
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1251427562 @gaborgsomogyi @dongjoon-hyun @HyukjinKwon Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-19 Thread GitBox
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r974575955 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -730,6 +729,13 @@ trait CheckAnalysis extends PredicateHelper with

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-19 Thread GitBox
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974517188 ## python/pyspark/worker.py: ## @@ -361,6 +429,32 @@ def read_udfs(pickleSer, infile, eval_type): if eval_type ==

[GitHub] [spark] roczei commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-19 Thread GitBox
roczei commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1251364841 Thanks @cloud-fan, I have implemented this and all tests passed. As I see we have resolved all of your feedbacks. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] mridulm commented on pull request #37922: [WIP][SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-09-19 Thread GitBox
mridulm commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1251349810 > The push-based shuffle service will auto clean up the old shuffle merge data Consider the case I mentioned above - stage retry for an `INDETERMINATE` stage. We cleanup

[GitHub] [spark] ayudovin commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-19 Thread GitBox
ayudovin commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r974514017 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,98 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def

[GitHub] [spark] pralabhkumar commented on pull request #37417: [SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster m

2022-09-19 Thread GitBox
pralabhkumar commented on PR #37417: URL: https://github.com/apache/spark/pull/37417#issuecomment-1251334364 @dongjoon-hyun , Have incorporated all the review comments , please look into the same. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] AmplabJenkins commented on pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-19 Thread GitBox
AmplabJenkins commented on PR #37930: URL: https://github.com/apache/spark/pull/37930#issuecomment-1251324416 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] xiaonanyang-db opened a new pull request, #37933: SPARK-40474 Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox
xiaonanyang-db opened a new pull request, #37933: URL: https://github.com/apache/spark/pull/37933 ### What changes were proposed in this pull request? Adjust part of changes in https://github.com/apache/spark/pull/36871. In the pr above, we introduced the support of date

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r974499166 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala: ## @@ -252,28 +267,44 @@ object

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065 Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where _user code_ could introduce a null. This

[GitHub] [spark] Yaohua628 commented on pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-19 Thread GitBox
Yaohua628 commented on PR #37905: URL: https://github.com/apache/spark/pull/37905#issuecomment-1251311583 > There's conflict in branch-3.3. @Yaohua628 Could you please craft a PR for branch-3.3? Thanks in advance! Done! https://github.com/apache/spark/pull/37932 - Thank you --

[GitHub] [spark] Yaohua628 commented on pull request #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
Yaohua628 commented on PR #37932: URL: https://github.com/apache/spark/pull/37932#issuecomment-1251310801 cc @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] Yaohua628 opened a new pull request, #37932: [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata

2022-09-19 Thread GitBox
Yaohua628 opened a new pull request, #37932: URL: https://github.com/apache/spark/pull/37932 ### What changes were proposed in this pull request? Cherry-picked from #37905 Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata`

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974491025 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -2159,6 +2171,16 @@ private[spark] class DAGScheduler( } } + private def

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974490653 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1860,8 +1863,17 @@ private[spark] class DAGScheduler( s"(attempt

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

2022-09-19 Thread GitBox
dongjoon-hyun commented on code in PR #37924: URL: https://github.com/apache/spark/pull/37924#discussion_r974486551 ## docs/configuration.md: ## @@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful 2.2.0 + +

  1   2   >