[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874263680 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,374 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] dcoliversun commented on pull request #36567: [SPARK-39196][CORE][SQL][K8S] replace `getOrElse(null)` with `orNull`

2022-05-16 Thread GitBox
dcoliversun commented on PR #36567: URL: https://github.com/apache/spark/pull/36567#issuecomment-1128289511 @srowen Hi. I check it again. All `getOrElse(null)`s are replaced with `orNull`. Let's see the result of GA. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] attilapiros commented on a diff in pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-16 Thread GitBox
attilapiros commented on code in PR #36529: URL: https://github.com/apache/spark/pull/36529#discussion_r874324907 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -362,6 +364,18 @@ public static byte[] bufferToArray(ByteBuffer buffer) {

[GitHub] [spark] HyukjinKwon commented on pull request #36575: [SPARK-39206][INFRA][PS] Add PS label for Pandas API on Spark in PRs

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36575: URL: https://github.com/apache/spark/pull/36575#issuecomment-1128389654 cc @zhengruifeng @itholic @xinrong-databricks @ueshin @Yikun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] HyukjinKwon commented on pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36574: URL: https://github.com/apache/spark/pull/36574#issuecomment-1128389821 Oh! you were faster than me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] gengliangwang commented on pull request #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on

2022-05-16 Thread GitBox
gengliangwang commented on PR #36570: URL: https://github.com/apache/spark/pull/36570#issuecomment-1128416447 @HyukjinKwon I think you meant to ping @dtenedor -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] sadikovi commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox
sadikovi commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874261650 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala: ## @@ -30,29 +30,16 @@ import

[GitHub] [spark] LuciferYang commented on a diff in pull request #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`

2022-05-16 Thread GitBox
LuciferYang commented on code in PR #36571: URL: https://github.com/apache/spark/pull/36571#discussion_r874324355 ## sql/core/src/test/scala/org/apache/spark/sql/execution/ColumnVectorUtilsBenchmark.scala: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] LuciferYang commented on a diff in pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-16 Thread GitBox
LuciferYang commented on code in PR #36529: URL: https://github.com/apache/spark/pull/36529#discussion_r874332519 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -362,6 +364,18 @@ public static byte[] bufferToArray(ByteBuffer buffer) {

[GitHub] [spark] LuciferYang commented on a diff in pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-16 Thread GitBox
LuciferYang commented on code in PR #36529: URL: https://github.com/apache/spark/pull/36529#discussion_r874332519 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -362,6 +364,18 @@ public static byte[] bufferToArray(ByteBuffer buffer) {

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128374030 I am surprised that the same Python library returns a different value depending on OS. I know some libraries are dependent on C library implementation but didn't expect such drastic

[GitHub] [spark] viirya commented on pull request #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject

2022-05-16 Thread GitBox
viirya commented on PR #36572: URL: https://github.com/apache/spark/pull/36572#issuecomment-1128384199 lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

[GitHub] [spark] LuciferYang commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
LuciferYang commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128394346 > Can you add a test for this? What is "Pass GA"? I tried refactor `ParquetIOSuite` to test V1 and V2 like `Write TimestampNTZ type`, but data still write by V1 API now due to

[GitHub] [spark] sadikovi commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
sadikovi commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128394222 Can you link the code to highlight the inconsistency between v1 and v2 configuration? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] AmplabJenkins commented on pull request #36567: [SPARK-39196][CORE][SQL][K8S] replace `getOrElse(null)` with `orNull`

2022-05-16 Thread GitBox
AmplabJenkins commented on PR #36567: URL: https://github.com/apache/spark/pull/36567#issuecomment-1128394243 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AmplabJenkins commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128394268 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] LuciferYang commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
LuciferYang commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128403631 > I think I just missed it to be honest and I did not add a test for it so the bug slipped during the review. I think we should add a test for both v1 and v2 data sources to make

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128422820 @AnywalkerGiser mind pointing out any documentation that states this negative value? -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128422483 Understood, looking forward to your reply later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874263076 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,374 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874263200 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,374 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-16 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r873426625 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,79 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] LuciferYang opened a new pull request, #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
LuciferYang opened a new pull request, #36573: URL: https://github.com/apache/spark/pull/36573 ### What changes were proposed in this pull request? SPARK-38829 adds `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for Parquet V1 read, V1 write, this pr add it to Parquet V2 write process to

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128391117 The localtime function does not differ in OS, the fromtimestamp function does, and datetime many functions have problems resolving dates before 1970 in windows. -- This is an

[GitHub] [spark] HyukjinKwon commented on pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36574: URL: https://github.com/apache/spark/pull/36574#issuecomment-1128391140 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] LuciferYang commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
LuciferYang commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128399257 @sadikovi I found you added this configuration at the following location: `ParquetFileFormat.prepareWrite`

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128418090 I would like to understand the problem first, and see if this behaviour difference is internal or official. This is core code path so would have to expect more reviews and time. --

[GitHub] [spark] github-actions[bot] commented on pull request #35342: [SPARK-38043][SQL] Refactor FileBasedDataSourceSuite and add DataSourceSuite for each data source

2022-05-16 Thread GitBox
github-actions[bot] commented on PR #35342: URL: https://github.com/apache/spark/pull/35342#issuecomment-1128264145 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #34359: [SPARK-36986][SQL] Improving schema filtering flexibility

2022-05-16 Thread GitBox
github-actions[bot] commented on PR #34359: URL: https://github.com/apache/spark/pull/34359#issuecomment-1128264166 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] warrenzhu25 commented on a diff in pull request #35498: [SPARK-34777][UI] StagePage input/output size records not show when r…

2022-05-16 Thread GitBox
warrenzhu25 commented on code in PR #35498: URL: https://github.com/apache/spark/pull/35498#discussion_r874254004 ## core/src/main/resources/org/apache/spark/ui/static/stagepage.js: ## @@ -404,8 +404,8 @@ $(document).ready(function () { var responseBody = response;

[GitHub] [spark] HyukjinKwon commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36499: URL: https://github.com/apache/spark/pull/36499#issuecomment-1128272888 Yeah, I don't quite follow the change. Why do we need to change precision and scale? -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] ulysses-you commented on pull request #36530: [SPARK-39172][SQL] Remove left/right outer join if only left/right side columns are selected and the join keys on the other side are uniq

2022-05-16 Thread GitBox
ulysses-you commented on PR #36530: URL: https://github.com/apache/spark/pull/36530#issuecomment-1128293347 thank you @cloud-fan @sigmod for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] physinet commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-16 Thread GitBox
physinet commented on PR #36545: URL: https://github.com/apache/spark/pull/36545#issuecomment-1128297953 > We should probably add a configuration like `spark.sql.pyspark.legacy.inferFirstElementInArray.enabled` (feel free to pick other names if you have other ideas). Would also have to

[GitHub] [spark] physinet commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-16 Thread GitBox
physinet commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874275723 ## python/pyspark/sql/tests/test_types.py: ## @@ -285,6 +285,64 @@ def test_infer_nested_dict_as_struct(self): df = self.spark.createDataFrame(data)

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36445: [SPARK-39096][SQL] Support MERGE commands with DEFAULT values

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36445: URL: https://github.com/apache/spark/pull/36445#discussion_r874322941 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala: ## @@ -1335,7 +1346,7 @@ class PlanResolutionSuite extends AnalysisTest

[GitHub] [spark] HyukjinKwon opened a new pull request, #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on

2022-05-16 Thread GitBox
HyukjinKwon opened a new pull request, #36570: URL: https://github.com/apache/spark/pull/36570 ### What changes were proposed in this pull request? This PR is a minor followup of https://github.com/apache/spark/pull/36445 that fixes the tests to pass with ANSI mode is on. Currently,

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128378813 @HyukjinKwon [Python3 datetime interface documentation](https://docs.python.org/3/library/datetime.html) This is explained in this document.

[GitHub] [spark] HyukjinKwon closed pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
HyukjinKwon closed pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label URL: https://github.com/apache/spark/pull/36574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128391633 @HyukjinKwon Can this solution be merged into master? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] HyukjinKwon closed pull request #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on

2022-05-16 Thread GitBox
HyukjinKwon closed pull request #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on URL: https://github.com/apache/spark/pull/36570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128434671 Here are some blogs on related issues: [Python | mktime overflow error](https://stackoverflow.com/questions/2518706/python-mktime-overflow-error) [Python fromtimestamp

[GitHub] [spark] HyukjinKwon commented on pull request #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36570: URL: https://github.com/apache/spark/pull/36570#issuecomment-1128434537 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] xinrong-databricks opened a new pull request, #36569: Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-16 Thread GitBox
xinrong-databricks opened a new pull request, #36569: URL: https://github.com/apache/spark/pull/36569 ### What changes were proposed in this pull request? Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`. ### Why are the changes needed?

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36445: [SPARK-39096][SQL] Support MERGE commands with DEFAULT values

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36445: URL: https://github.com/apache/spark/pull/36445#discussion_r874271010 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala: ## @@ -1335,7 +1346,7 @@ class PlanResolutionSuite extends AnalysisTest

[GitHub] [spark] zhengruifeng commented on a diff in pull request #36560: [SPARK-39192][PYTHON] Make pandas-on-spark's kurt consistent with pandas

2022-05-16 Thread GitBox
zhengruifeng commented on code in PR #36560: URL: https://github.com/apache/spark/pull/36560#discussion_r874287345 ## python/pyspark/pandas/tests/test_generic_functions.py: ## @@ -150,8 +150,8 @@ def test_stat_functions(self):

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128370730 @HyukjinKwon Do you mean the platform library, added to the code comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] HyukjinKwon opened a new pull request, #36575: Add PS label for Pandas API on Spark in PRs

2022-05-16 Thread GitBox
HyukjinKwon opened a new pull request, #36575: URL: https://github.com/apache/spark/pull/36575 ### What changes were proposed in this pull request? This PR proposes to add "PS" label automatically to PRs. ### Why are the changes needed? We added Pandas API on Spark in

[GitHub] [spark] LuciferYang commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
LuciferYang commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128396037 `ParquetFileFormat.prepareWrite` has the following configuration:

[GitHub] [spark] beobest2 commented on pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
beobest2 commented on PR #36509: URL: https://github.com/apache/spark/pull/36509#issuecomment-1128395540 @HyukjinKwon I've fixed this as you reviewed. Please check recent commits : ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128444516 I thought it was a bug in python, but the documentation said it would report an error if it was out of time range. I tested python 3.6, 3.7, and 3.8 on windows. -- This is an

[GitHub] [spark] zhengruifeng commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-16 Thread GitBox
zhengruifeng commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1128336669 +1 for what @ueshin @Yikun commented. > The only potential issue I can imagine, we have to use builtins.set when we want to call the built-in func under the same namespace

[GitHub] [spark] HyukjinKwon commented on pull request #36570: [SPARK-39096][SQL][FOLLOW-UP] Fix "MERGE INTO TABLE" test to pass with ANSI mode on

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36570: URL: https://github.com/apache/spark/pull/36570#issuecomment-1128361630 cc @danielsan @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] LuciferYang opened a new pull request, #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`

2022-05-16 Thread GitBox
LuciferYang opened a new pull request, #36571: URL: https://github.com/apache/spark/pull/36571 ### What changes were proposed in this pull request? This pr add a `putByteArrays` method to `WritableColumnVector` as follows: ```java public int putByteArrays(int rowId, int total,

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36566: URL: https://github.com/apache/spark/pull/36566#discussion_r874323369 ## python/pyspark/sql/types.py: ## @@ -212,15 +213,29 @@ def needConversion(self) -> bool: def toInternal(self, dt: datetime.datetime) -> int: if dt

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128362425 @AnywalkerGiser BTW, it would be easier to follow and review if you have any link on the OS difference in that Python library. -- This is an automated message from the Apache Git

[GitHub] [spark] AnywalkerGiser commented on a diff in pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on code in PR #36566: URL: https://github.com/apache/spark/pull/36566#discussion_r874328761 ## python/pyspark/sql/types.py: ## @@ -212,15 +213,29 @@ def needConversion(self) -> bool: def toInternal(self, dt: datetime.datetime) -> int: if

[GitHub] [spark] sadikovi commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
sadikovi commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128402353 I think I just forgot to be honest and I did not add a test for it so the bug slipped during the review. I think we should add a test for both v1 and v2 data sources to make sure it

[GitHub] [spark] sadikovi commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
sadikovi commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128408501 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] dtenedor commented on a diff in pull request #36445: [SPARK-39096][SQL] Support MERGE commands with DEFAULT values

2022-05-16 Thread GitBox
dtenedor commented on code in PR #36445: URL: https://github.com/apache/spark/pull/36445#discussion_r874274237 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala: ## @@ -1335,7 +1346,7 @@ class PlanResolutionSuite extends AnalysisTest {

[GitHub] [spark] Yikun commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-16 Thread GitBox
Yikun commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1128324461 > But if you look at the code that pandas are using, they [decorate the function](https://github.com/pandas-dev/pandas/blob/v1.4.2/pandas/core/groupby/groupby.py#L1810) with @final

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874336480 ## python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst: ## @@ -0,0 +1,23 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more

[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36545: URL: https://github.com/apache/spark/pull/36545#issuecomment-1128377781 cc @BryanCutler @viirya @ueshin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874336689 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3758,6 +3758,15 @@ object SQLConf { .booleanConf

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36574: URL: https://github.com/apache/spark/pull/36574#discussion_r874348703 ## .github/labeler.yml: ## @@ -130,6 +130,8 @@ STRUCTURED STREAMING: PYTHON: - "bin/pyspark*" - "**/python/**/*" +PANDAS API ON SPARK: Review Comment: Oh

[GitHub] [spark] sadikovi commented on pull request #36573: [SPARK-38829][SQL][FOLLOWUP] Add `PARQUET_TIMESTAMP_NTZ_ENABLED` configuration for `ParquetWrite.prepareWrite`

2022-05-16 Thread GitBox
sadikovi commented on PR #36573: URL: https://github.com/apache/spark/pull/36573#issuecomment-1128390778 Can you add a test for this? What is "Pass GA"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] HyukjinKwon commented on pull request #36575: [SPARK-39206][INFRA][PS] Add PS label for Pandas API on Spark in PRs

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36575: URL: https://github.com/apache/spark/pull/36575#issuecomment-1128390576 oops. I was late. https://github.com/apache/spark/pull/36574 was open first  -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] HyukjinKwon closed pull request #36575: [SPARK-39206][INFRA][PS] Add PS label for Pandas API on Spark in PRs

2022-05-16 Thread GitBox
HyukjinKwon closed pull request #36575: [SPARK-39206][INFRA][PS] Add PS label for Pandas API on Spark in PRs URL: https://github.com/apache/spark/pull/36575 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] AnywalkerGiser commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128436320 I think the Spark project can look for a solution if Python doesn't fix this bug. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] srowen commented on pull request #36544: [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2

2022-05-16 Thread GitBox
srowen commented on PR #36544: URL: https://github.com/apache/spark/pull/36544#issuecomment-1128225167 Merged to master/3.3/3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] srowen closed pull request #36544: [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2

2022-05-16 Thread GitBox
srowen closed pull request #36544: [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2 URL: https://github.com/apache/spark/pull/36544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874262559 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,374 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] LuciferYang commented on a diff in pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-16 Thread GitBox
LuciferYang commented on code in PR #36529: URL: https://github.com/apache/spark/pull/36529#discussion_r874326565 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -362,6 +364,18 @@ public static byte[] bufferToArray(ByteBuffer buffer) {

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128380392 So is C localtime function has a different behaviour in OS? that returns negative values? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] cloud-fan opened a new pull request, #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject

2022-05-16 Thread GitBox
cloud-fan opened a new pull request, #36572: URL: https://github.com/apache/spark/pull/36572 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/36510 , to fix a corner case: if the `CreateStruct` is only referenced

[GitHub] [spark] cloud-fan commented on pull request #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject

2022-05-16 Thread GitBox
cloud-fan commented on PR #36572: URL: https://github.com/apache/spark/pull/36572#issuecomment-1128380010 cc @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128437613 is this a bug in Python? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on PR #36566: URL: https://github.com/apache/spark/pull/36566#issuecomment-1128437464 Does that happen in all Windows with all Python versions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874264598 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,374 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] zhengruifeng commented on pull request #36560: [SPARK-39192][PYTHON] Make pandas-on-spark's kurt consistent with pandas

2022-05-16 Thread GitBox
zhengruifeng commented on PR #36560: URL: https://github.com/apache/spark/pull/36560#issuecomment-1128288307 oh, the test failed, let me look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] Yikun opened a new pull request, #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
Yikun opened a new pull request, #36574: URL: https://github.com/apache/spark/pull/36574 ### What changes were proposed in this pull request? Add `PANDAS API ON SPARK` label ### Why are the changes needed? Add `PANDAS API ON SPARK` label ### Does this PR introduce

[GitHub] [spark] Yikun commented on pull request #36574: [SPARK-39205][PYTHON][PS][INFRA] Add `PANDAS API ON SPARK` label

2022-05-16 Thread GitBox
Yikun commented on PR #36574: URL: https://github.com/apache/spark/pull/36574#issuecomment-1128392917 > Oh! you were faster than me! lol, just for explain clearly in mail list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] LuciferYang commented on pull request #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`

2022-05-16 Thread GitBox
LuciferYang commented on PR #36571: URL: https://github.com/apache/spark/pull/36571#issuecomment-1128419927 For `ColumnVectorUtils.populate` method: ```scala def testPopulate(valuesPerIteration: Int, length: Int): Unit = { val batchSize = 4096 val

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36566: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
HyukjinKwon commented on code in PR #36566: URL: https://github.com/apache/spark/pull/36566#discussion_r874374279 ## python/pyspark/sql/types.py: ## @@ -212,15 +213,29 @@ def needConversion(self) -> bool: def toInternal(self, dt: datetime.datetime) -> int: if dt

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-16 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r873437306 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,60 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] AnywalkerGiser opened a new pull request, #36559: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser opened a new pull request, #36559: URL: https://github.com/apache/spark/pull/36559 ### What changes were proposed in this pull request? Fix problems with pyspark in Windows: 1. Fixed datetime conversion to timestamp before 1970; 2. Fixed datetime

[GitHub] [spark] cloud-fan commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
cloud-fan commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873386778 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -82,52 +82,45 @@ abstract class SparkStrategies extends

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873403382 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -814,12 +815,19 @@ abstract class SparkStrategies extends

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873410416 ## sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala: ## @@ -278,9 +265,9 @@ case class GlobalLimitAndOffsetExec( } /** - * Take the first limit

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873409826 ## sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala: ## @@ -215,61 +211,52 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) extends

[GitHub] [spark] cloud-fan commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
cloud-fan commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873423506 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1303,12 +1303,25 @@ case class LocalLimit(limitExpr:

[GitHub] [spark] AnywalkerGiser closed pull request #36559: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-16 Thread GitBox
AnywalkerGiser closed pull request #36559: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows URL: https://github.com/apache/spark/pull/36559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873401681 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1303,12 +1303,25 @@ case class LocalLimit(limitExpr:

[GitHub] [spark] Yikun commented on a diff in pull request #36464: [SPARK-38947][PYTHON] Supports groupby positional indexing

2022-05-16 Thread GitBox
Yikun commented on code in PR #36464: URL: https://github.com/apache/spark/pull/36464#discussion_r873437306 ## python/pyspark/pandas/groupby.py: ## @@ -2110,22 +2110,60 @@ def _limit(self, n: int, asc: bool) -> FrameLike: groupkey_scols =

[GitHub] [spark] zhengruifeng opened a new pull request, #36560: [SPARK-39192][PYTHON] make pandas-on-spark's kurt consistent with pandas

2022-05-16 Thread GitBox
zhengruifeng opened a new pull request, #36560: URL: https://github.com/apache/spark/pull/36560 ### What changes were proposed in this pull request? make pandas-on-spark's kurt consistent with pandas ### Why are the changes needed? 1, the formulas of Kurtosis were different

[GitHub] [spark] gengliangwang commented on pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox
gengliangwang commented on PR #36562: URL: https://github.com/apache/spark/pull/36562#issuecomment-1127384475 cc @sadikovi who reported this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] MaxGekk opened a new pull request, #36558: [SPARK-39187][SQL][3.3] Remove `SparkIllegalStateException`

2022-05-16 Thread GitBox
MaxGekk opened a new pull request, #36558: URL: https://github.com/apache/spark/pull/36558 ### What changes were proposed in this pull request? Remove `SparkIllegalStateException` and replace it by `IllegalStateException` where it was used. This is a backport of

[GitHub] [spark] beliefer commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
beliefer commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873402602 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala: ## @@ -81,55 +81,56 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan]

[GitHub] [spark] cloud-fan commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-16 Thread GitBox
cloud-fan commented on code in PR #36541: URL: https://github.com/apache/spark/pull/36541#discussion_r873425084 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1303,12 +1303,25 @@ case class LocalLimit(limitExpr:

[GitHub] [spark] gengliangwang opened a new pull request, #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox
gengliangwang opened a new pull request, #36562: URL: https://github.com/apache/spark/pull/36562 ### What changes were proposed in this pull request? When reading JSON/CSV files with inferring timestamp types (`.option("inferTimestamp", true)`), the Timestamp conversion will

[GitHub] [spark] MaxGekk commented on pull request #36558: [SPARK-39187][SQL][3.3] Remove `SparkIllegalStateException`

2022-05-16 Thread GitBox
MaxGekk commented on PR #36558: URL: https://github.com/apache/spark/pull/36558#issuecomment-1127392276 Merging to 3.3. Thank you, @HyukjinKwon for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

  1   2   >