[GitHub] [spark] chaoqin-li1123 opened a new pull request, #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
chaoqin-li1123 opened a new pull request, #38013: URL: https://github.com/apache/spark/pull/38013 ### What changes were proposed in this pull request? An example for applyInPandasWithState usage. This example split lines into words, group by words as key and use the state per key

[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-27 Thread GitBox
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1259029762 Can you take a look at comment above @yabola and work on the fix ? Since you already spent a lot of time on this. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] itholic closed pull request #38012: [DO-NOT-MERGE][TEST] Pandas 1.5 Test

2022-09-27 Thread GitBox
itholic closed pull request #38012: [DO-NOT-MERGE][TEST] Pandas 1.5 Test URL: https://github.com/apache/spark/pull/38012 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] zhengruifeng closed pull request #38009: [SPARK-40573][PS] Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers

2022-09-27 Thread GitBox
zhengruifeng closed pull request #38009: [SPARK-40573][PS] Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers URL: https://github.com/apache/spark/pull/38009 -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-27 Thread GitBox
dongjoon-hyun commented on code in PR #38001: URL: https://github.com/apache/spark/pull/38001#discussion_r980867778 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3574,6 +3574,15 @@ object SQLConf { .booleanConf

[GitHub] [spark] itholic opened a new pull request, #38016: [SPARK-40577][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
itholic opened a new pull request, #38016: URL: https://github.com/apache/spark/pull/38016 ### What changes were proposed in this pull request? This PR proposes to fix the test `IndexesTest.test_to_frame` to support pandas 1.5.0 ### Why are the changes needed?

[GitHub] [spark] zhengruifeng opened a new pull request, #38017: [SPARK-40579][PS] `GroupBy.first` should skip NULLs

2022-09-27 Thread GitBox
zhengruifeng opened a new pull request, #38017: URL: https://github.com/apache/spark/pull/38017 ### What changes were proposed in this pull request? make `GroupBy.first` skip nulls ### Why are the changes needed? to fix the behavior difference ``` In [1]:

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
HeartSaVioR commented on code in PR #38013: URL: https://github.com/apache/spark/pull/38013#discussion_r980858543 ## examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py: ## @@ -0,0 +1,114 @@ +# +# Licensed to the Apache Software Foundation

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
HeartSaVioR commented on code in PR #38013: URL: https://github.com/apache/spark/pull/38013#discussion_r980870512 ## examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py: ## @@ -0,0 +1,114 @@ +# +# Licensed to the Apache Software Foundation

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980939958 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] itholic commented on pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
itholic commented on PR #38018: URL: https://github.com/apache/spark/pull/38018#issuecomment-1259194440 Yeah, I think maybe we should also address the other I/O functions if there is behavior differences. We already document about the difference for almost I/O functions, but seems

[GitHub] [spark] HeartSaVioR commented on pull request #37936: [SPARK-40495] [SQL] [TESTS] Add additional tests to StreamingSessionWindowSuite

2022-09-27 Thread GitBox
HeartSaVioR commented on PR #37936: URL: https://github.com/apache/spark/pull/37936#issuecomment-1259257462 @WweiL GA build unfortunately caught the unused import. Could you please run `mvn clean install -DskipTests` and `dev/scalastyle` and make sure both pass, before pushing a new

[GitHub] [spark] AmplabJenkins commented on pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
AmplabJenkins commented on PR #38013: URL: https://github.com/apache/spark/pull/38013#issuecomment-1259256936 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] zhengruifeng opened a new pull request, #38014: [SPARK-40575][DOCS] Add badges for PySpark downloads

2022-09-27 Thread GitBox
zhengruifeng opened a new pull request, #38014: URL: https://github.com/apache/spark/pull/38014 ### What changes were proposed in this pull request? Add badges for PySpark downloads ### Why are the changes needed? projects like

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259053609 @caican00 I'm not sure whether it would be better change o use `toJavaMap` or `toJavaMap.asScala` here? Can you help test it? -- This is an automated message from

[GitHub] [spark] zhengruifeng commented on pull request #38009: [SPARK-40573][PS] Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers

2022-09-27 Thread GitBox
zhengruifeng commented on PR #38009: URL: https://github.com/apache/spark/pull/38009#issuecomment-1259053301 Merged into master, thanks @HyukjinKwon for reivews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] roczei commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-27 Thread GitBox
roczei commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1259053324 @cloud-fan, Thank you very much for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38015: [SPARK-40577][PS] Fix `CategoricalIndex.append` to match pandas 1.5.0

2022-09-27 Thread GitBox
zhengruifeng commented on code in PR #38015: URL: https://github.com/apache/spark/pull/38015#discussion_r980909404 ## python/pyspark/pandas/indexes/base.py: ## @@ -1907,6 +1908,9 @@ def append(self, other: "Index") -> "Index": ) index_fields =

[GitHub] [spark] huleilei commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
huleilei commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r980926707 ## sql/hive/src/test/resources/ql/src/test/queries/clientpositive/index_bitmap2.q: ## @@ -4,7 +4,10 @@ CREATE INDEX src1_index ON TABLE src(key) as 'BITMAP' WITH

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980967098 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -137,48 +138,49 @@ class DatasetUnpivotSuite extends QueryTest

[GitHub] [spark] itholic commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980971632 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] itholic commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980972119 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37995: [SPARK-40556][PS][SQL] Unpersist the intermediate datasets cached in `AttachDistributedSequenceExec`

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #37995: URL: https://github.com/apache/spark/pull/37995#discussion_r980983915 ## python/pyspark/pandas/series.py: ## @@ -6442,6 +6445,8 @@ def argmin(self, axis: Axis = None, skipna: bool = True) -> int: raise ValueError("axis

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980984200 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-27 Thread GitBox
dongjoon-hyun commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1259217641 Thank you again, @cloud-fan , @viirya , @thiyaga, @huaxingao , @zhengruifeng . Since the last commit is about doc, I'll merge this. Merged to master/3.3/3.2. cc

[GitHub] [spark] dongjoon-hyun closed pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-27 Thread GitBox
dongjoon-hyun closed pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy` URL: https://github.com/apache/spark/pull/38001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] chaoqin-li1123 commented on pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
chaoqin-li1123 commented on PR #38013: URL: https://github.com/apache/spark/pull/38013#issuecomment-1259024291 @HeartSaVioR The applyInPandasWithState session window example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] yabola commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-27 Thread GitBox
yabola commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1259032890 @mridulm Really thanks for your analysis! Please give me some time to understand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] itholic opened a new pull request, #38015: [SPARK-40577][PS] Fix CategoricalIndex.append to match pandas 1.5.0

2022-09-27 Thread GitBox
itholic opened a new pull request, #38015: URL: https://github.com/apache/spark/pull/38015 ### What changes were proposed in this pull request? The PR proposes to fix `CategoricalIndex.append` to match the behavior with pandas. ### Why are the changes needed?

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259066433 Thanks ~ @caican00 waiting for your feedback :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980870200 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] Yikf commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
Yikf commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r980876832 ## sql/hive/src/test/resources/ql/src/test/queries/clientpositive/index_bitmap2.q: ## @@ -4,7 +4,10 @@ CREATE INDEX src1_index ON TABLE src(key) as 'BITMAP' WITH DEFERRED

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980878546 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
zhengruifeng commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980911416 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] huleilei commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
huleilei commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r980919285 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -216,6 +216,7 @@ statement LEFT_PAREN

[GitHub] [spark] HeartSaVioR closed pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worker fail

2022-09-27 Thread GitBox
HeartSaVioR closed pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worker failures URL: https://github.com/apache/spark/pull/38008 -- This is an automated message from the Apache Git

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980944809 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] huleilei commented on pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
huleilei commented on PR #38007: URL: https://github.com/apache/spark/pull/38007#issuecomment-1259188003 > @huleilei mind completing the PR description? OK, I have completed the PR description. Thanks. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] caican00 commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
caican00 commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259039554 > the collection size is greater than 500 `the collection size is greater than 500`, is it the number of elements in a collection? -- This is an automated message from the

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259058629 @caican00 Or if you can provide a micro-bench that can be run with GA, I am happy to continue to solve your issue together -- This is an automated message from the Apache Git

[GitHub] [spark] EvgenyZamyatin commented on pull request #37967: Scalable SkipGram-Word2Vec implementation

2022-09-27 Thread GitBox
EvgenyZamyatin commented on PR #37967: URL: https://github.com/apache/spark/pull/37967#issuecomment-1259169597 > is it possible to improve existing w2v instead of implementing a new one? Yes. How do you think it should be done? Under a mode setting? > what about implementing it in

[GitHub] [spark] zhengruifeng commented on pull request #37770: [SPARK-40314][SQL][PYTHON] Add scala and python bindings for inline and inline_outer

2022-09-27 Thread GitBox
zhengruifeng commented on PR #37770: URL: https://github.com/apache/spark/pull/37770#issuecomment-1259182911 also, what about adding some tests in `python/pyspark/sql/tests/test_functions.py`? -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] itholic commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980971632 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] itholic commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980976721 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37995: [SPARK-40556][PS][SQL] Unpersist the intermediate datasets cached in `AttachDistributedSequenceExec`

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #37995: URL: https://github.com/apache/spark/pull/37995#discussion_r980983915 ## python/pyspark/pandas/series.py: ## @@ -6442,6 +6445,8 @@ def argmin(self, axis: Axis = None, skipna: bool = True) -> int: raise ValueError("axis

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #38018: URL: https://github.com/apache/spark/pull/38018#discussion_r980991301 ## python/pyspark/pandas/frame.py: ## @@ -5317,6 +5317,12 @@ def to_orc( ... '%s/to_orc/foo.orc' % path, ... mode = 'overwrite',

[GitHub] [spark] caican00 commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
caican00 commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259063989 > > @caican00 I'm not sure whether it would be better change o use `toJavaMap` or `toJavaMap.asScala` here? Can you help test it? > > Hmm... Could you try this one? Okay.

[GitHub] [spark] cloud-fan commented on a diff in pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37825: URL: https://github.com/apache/spark/pull/37825#discussion_r980852866 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala: ## @@ -254,7 +254,9 @@ object RewriteDistinctAggregates

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980897890 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38016: [SPARK-40578][PS] Fix `IndexesTest.test_to_frame` when pandas 1.5.0

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #38016: URL: https://github.com/apache/spark/pull/38016#discussion_r980898223 ## python/pyspark/pandas/tests/indexes/test_base.py: ## @@ -203,9 +203,35 @@ def test_to_frame(self): # non-string names

[GitHub] [spark] itholic opened a new pull request, #38018: [SPARK-40580] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
itholic opened a new pull request, #38018: URL: https://github.com/apache/spark/pull/38018 ### What changes were proposed in this pull request? This PR proposes to update the docstring of `DataFrame.to_orc`, since `pandas.DataFrame.to_orc` is supported from pandas 1.5.0,

[GitHub] [spark] huleilei commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
huleilei commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r980928571 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowIndexExec.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37770: [SPARK-40314][SQL][PYTHON] Add scala and python bindings for inline and inline_outer

2022-09-27 Thread GitBox
zhengruifeng commented on code in PR #37770: URL: https://github.com/apache/spark/pull/37770#discussion_r980946968 ## sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala: ## @@ -219,20 +219,21 @@ class GeneratorFunctionSuite extends QueryTest with

[GitHub] [spark] itholic commented on a diff in pull request #38015: [SPARK-40577][PS] Fix `CategoricalIndex.append` to match pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38015: URL: https://github.com/apache/spark/pull/38015#discussion_r980954195 ## python/pyspark/pandas/indexes/base.py: ## @@ -1907,6 +1908,9 @@ def append(self, other: "Index") -> "Index": ) index_fields =

[GitHub] [spark] zhengruifeng commented on pull request #37759: [SPARK-40306][SQL]Support more than Integer.MAX_VALUE of the same join key

2022-09-27 Thread GitBox
zhengruifeng commented on PR #37759: URL: https://github.com/apache/spark/pull/37759#issuecomment-1259211420 @wankunde in your UT, the variable `duplicateKeyNumber` is negative ``` scala> val duplicateKeyNumber = Integer.MAX_VALUE + 2 val duplicateKeyNumber: Int =

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #38018: URL: https://github.com/apache/spark/pull/38018#discussion_r980990148 ## python/pyspark/pandas/frame.py: ## @@ -5266,12 +5266,12 @@ def to_orc( **options: "OptionalPrimitiveType", ) -> None: """ -Write

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #38018: URL: https://github.com/apache/spark/pull/38018#discussion_r980990713 ## python/pyspark/pandas/frame.py: ## @@ -5317,6 +5317,12 @@ def to_orc( ... '%s/to_orc/foo.orc' % path, ... mode = 'overwrite',

[GitHub] [spark] HeartSaVioR commented on pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
HeartSaVioR commented on PR #38013: URL: https://github.com/apache/spark/pull/38013#issuecomment-1259259980 One tip, unlike Scala/Java code, we can leverage `dev/reformat-python` to reformat python code automatically. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259032970 @caican00 Yes, it was also clear before that when the collection size is greater than 500, there will be no significant performance improvement. In fact, according to the test

[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-27 Thread GitBox
dongjoon-hyun commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1259077129 Thank you, @cloud-fan , @viirya , @huaxingao . Yes, as Wenchen shared, this is really Spark-specific syntax now. Let me add that to PR description. ``` hive> SELECT version();

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
HeartSaVioR commented on code in PR #38013: URL: https://github.com/apache/spark/pull/38013#discussion_r980856329 ## examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py: ## @@ -0,0 +1,114 @@ +# +# Licensed to the Apache Software Foundation

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-27 Thread GitBox
zhengruifeng commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980923342 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

[GitHub] [spark] HeartSaVioR commented on pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worke

2022-09-27 Thread GitBox
HeartSaVioR commented on PR #38008: URL: https://github.com/apache/spark/pull/38008#issuecomment-1259178940 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HeartSaVioR commented on pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worke

2022-09-27 Thread GitBox
HeartSaVioR commented on PR #38008: URL: https://github.com/apache/spark/pull/38008#issuecomment-1259178740 https://github.com/HeartSaVioR/spark/runs/8566461025 Looks like GA build for checking the result couldn't pull the result from forked repo. Maybe due to concurrent runs?

[GitHub] [spark] itholic commented on a diff in pull request #38015: [SPARK-40577][PS] Fix `CategoricalIndex.append` to match pandas 1.5.0

2022-09-27 Thread GitBox
itholic commented on code in PR #38015: URL: https://github.com/apache/spark/pull/38015#discussion_r980953075 ## python/pyspark/pandas/indexes/base.py: ## @@ -1907,6 +1908,9 @@ def append(self, other: "Index") -> "Index": ) index_fields =

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980969164 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -535,6 +548,98 @@ class DatasetUnpivotSuite extends QueryTest "val"),

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980968668 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -535,6 +548,98 @@ class DatasetUnpivotSuite extends QueryTest "val"),

[GitHub] [spark] caican00 commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
caican00 commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259021953 I tested it using a real job and the bottleneck seems to be still in `MapBuilder.$plus$eq`? And i have used a `for loop manually` for testing but still no significant improvement.

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259043105 > > the collection size is greater than 500 > > `the collection size is greater than 500`, is it the number of elements in a collection? Yes -- This is an automated

[GitHub] [spark] LuciferYang commented on pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][DSTREAM][R] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-27 Thread GitBox
LuciferYang commented on PR #37876: URL: https://github.com/apache/spark/pull/37876#issuecomment-1259063043 > @caican00 I'm not sure whether it would be better change o use `toJavaMap` or `toJavaMap.asScala` here? Can you help test it? Hmm... Could you try this one? --

[GitHub] [spark] HeartSaVioR commented on pull request #38013: [SPARK-40509][SS][PYTHON] add example for applyInPandasWithState

2022-09-27 Thread GitBox
HeartSaVioR commented on PR #38013: URL: https://github.com/apache/spark/pull/38013#issuecomment-1259090417 Thanks for the contribution @chaoqin-li1123 ! Looks like python linter is complaining - could you please look into this?

[GitHub] [spark] zhengruifeng commented on pull request #38010: [MINOR] Clarify that xxhash64 seed is 42

2022-09-27 Thread GitBox
zhengruifeng commented on PR #38010: URL: https://github.com/apache/spark/pull/38010#issuecomment-1259152469 Merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng closed pull request #38010: [MINOR] Clarify that xxhash64 seed is 42

2022-09-27 Thread GitBox
zhengruifeng closed pull request #38010: [MINOR] Clarify that xxhash64 seed is 42 URL: https://github.com/apache/spark/pull/38010 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] huleilei commented on a diff in pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
huleilei commented on code in PR #38007: URL: https://github.com/apache/spark/pull/38007#discussion_r980939477 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowIndexExec.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980960360 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2118,6 +2127,16 @@ class Dataset[T] private[sql]( valueColumnName: String): DataFrame =

[GitHub] [spark] LuciferYang commented on pull request #37999: [SPARK-39146][CORE][SQL][K8S] Introduce `JacksonUtils` to use singleton Jackson ObjectMapper

2022-09-27 Thread GitBox
LuciferYang commented on PR #37999: URL: https://github.com/apache/spark/pull/37999#issuecomment-1259031253 @srowen From the above test results, there is no significant performance difference between using global and local singletons. From a code perspective, thread safety should not

[GitHub] [spark] cloud-fan commented on a diff in pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37825: URL: https://github.com/apache/spark/pull/37825#discussion_r980860055 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala: ## @@ -402,7 +405,28 @@ object RewriteDistinctAggregates

[GitHub] [spark] HyukjinKwon commented on pull request #38014: [SPARK-40575][DOCS] Add badges for PySpark downloads

2022-09-27 Thread GitBox
HyukjinKwon commented on PR #38014: URL: https://github.com/apache/spark/pull/38014#issuecomment-1259131023 This one actually I don't feel strongly. We have a bunch of stats. e.g., Maven stats too. cc @srowen FYI -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] viirya commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-27 Thread GitBox
viirya commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1259140729 Thank you @dongjoon-hyun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] itholic commented on a diff in pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
itholic commented on code in PR #38018: URL: https://github.com/apache/spark/pull/38018#discussion_r980915746 ## python/pyspark/pandas/frame.py: ## @@ -5266,12 +5266,12 @@ def to_orc( **options: "OptionalPrimitiveType", ) -> None: """ -Write the

[GitHub] [spark] zhengruifeng commented on pull request #38007: [SPARK-40566][SQL] Add showIndex function

2022-09-27 Thread GitBox
zhengruifeng commented on PR #38007: URL: https://github.com/apache/spark/pull/38007#issuecomment-1259154257 @huleilei mind completing the PR description? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980941315 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +869,50 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37761: [SPARK-40311][SQL][PYTHON] Add withColumnsRenamed to scala and pyspark API

2022-09-27 Thread GitBox
zhengruifeng commented on code in PR #37761: URL: https://github.com/apache/spark/pull/37761#discussion_r980958358 ## python/pyspark/sql/dataframe.py: ## @@ -4430,6 +4430,50 @@ def withColumnRenamed(self, existing: str, new: str) -> "DataFrame": """ return

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980958517 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r980965661 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -137,48 +138,49 @@ class DatasetUnpivotSuite extends QueryTest

[GitHub] [spark] zhengruifeng commented on pull request #37995: [SPARK-40556][PS][SQL][WIP] Unpersist the intermediate datasets cached in `AttachDistributedSequenceExec`

2022-09-27 Thread GitBox
zhengruifeng commented on PR #37995: URL: https://github.com/apache/spark/pull/37995#issuecomment-1259197076 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r981173804 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -535,6 +548,98 @@ class DatasetUnpivotSuite extends QueryTest "val"),

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
HyukjinKwon commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r981253712 ## assembly/pom.xml: ## @@ -74,6 +74,11 @@ spark-repl_${scala.binary.version} ${project.version} + + org.apache.spark +

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1259519275 There's an outstanding comment: https://github.com/apache/spark/pull/37710#discussion_r978291019. I am working on this. -- This is an automated message from the Apache Git

[GitHub] [spark] bjornjorgensen commented on pull request #38018: [SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`.

2022-09-27 Thread GitBox
bjornjorgensen commented on PR #38018: URL: https://github.com/apache/spark/pull/38018#issuecomment-1259301993 This is not the same. `pandas API on Spark` or `pandas-on-Spark` Which one do we use? -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] pan3793 commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
pan3793 commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r981107986 ## assembly/pom.xml: ## @@ -74,6 +74,11 @@ spark-repl_${scala.binary.version} ${project.version} + + org.apache.spark +

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r981171196 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -535,6 +548,98 @@ class DatasetUnpivotSuite extends QueryTest "val"),

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r981213982 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] EnricoMi opened a new pull request, #38020: [SPARK-39877][FOLLOW-UP] PySpark DataFrame.unpivot allows for column names only

2022-09-27 Thread GitBox
EnricoMi opened a new pull request, #38020: URL: https://github.com/apache/spark/pull/38020 ### What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/37407#discussion_r977818035, method `pyspark.sql.DataFrame.unpivot` should support only

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-27 Thread GitBox
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r981213360 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1098,6 +1106,87 @@ class AstBuilder extends

[GitHub] [spark] LuciferYang commented on pull request #37999: [SPARK-39146][CORE][SQL][K8S] Introduce `JacksonUtils` to use singleton Jackson ObjectMapper

2022-09-27 Thread GitBox
LuciferYang commented on PR #37999: URL: https://github.com/apache/spark/pull/37999#issuecomment-1259478461 https://github.com/FasterXML/jackson-core/issues/349#issuecomment-280794659

[GitHub] [spark] LuciferYang commented on pull request #37999: [SPARK-39146][CORE][SQL][K8S] Introduce `JacksonUtils` to use singleton Jackson ObjectMapper

2022-09-27 Thread GitBox
LuciferYang commented on PR #37999: URL: https://github.com/apache/spark/pull/37999#issuecomment-1259522776 > I wonder if we can reuse ObjectMapper inside classes where it matters for perf and not try to share one instance so widely. According to this principle, it is enough to keep

[GitHub] [spark] LuciferYang commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
LuciferYang commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r981102013 ## assembly/pom.xml: ## @@ -74,6 +74,11 @@ spark-repl_${scala.binary.version} ${project.version} + + org.apache.spark +

[GitHub] [spark] LuciferYang commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
LuciferYang commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r981102013 ## assembly/pom.xml: ## @@ -74,6 +74,11 @@ spark-repl_${scala.binary.version} ${project.version} + + org.apache.spark +

[GitHub] [spark] LuciferYang commented on a diff in pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-27 Thread GitBox
LuciferYang commented on code in PR #37710: URL: https://github.com/apache/spark/pull/37710#discussion_r981102013 ## assembly/pom.xml: ## @@ -74,6 +74,11 @@ spark-repl_${scala.binary.version} ${project.version} + + org.apache.spark +

  1   2   3   >