date:20221117

[GitHub] [spark] github-actions[bot] commented on pull request #36695: [SPARK-38474][CORE] Use error class in org.apache.spark.security

2022-11-17 Thread GitBox



github-actions[bot] commented on PR #36695:
URL: https://github.com/apache/spark/pull/36695#issuecomment-1319387650

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38680: [SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils

2022-11-17 Thread GitBox



HeartSaVioR commented on PR #38680:
URL: https://github.com/apache/spark/pull/38680#issuecomment-1319400995

   https://github.com/rangadi/spark/actions/runs/3490416069/jobs/5847475694
   Second trial of build is passing for most of jobs and pending k8s 
integration which I don't believe this PR will break it.
   
   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



HyukjinKwon commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025959335


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   Ah .. this is tricky.. Let's just probably mark this test as `ignore` for 
now then.. with adding some comments like some custom memory configurations 
would have to be applied ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS

2022-11-17 Thread GitBox



LuciferYang commented on PR #38704:
URL: https://github.com/apache/spark/pull/38704#issuecomment-1319535893

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollecting

2022-11-17 Thread GitBox



LuciferYang opened a new pull request, #38704:
URL: https://github.com/apache/spark/pull/38704

   ### What changes were proposed in this pull request?
   This pr ignore `collect data with single partition larger than 2GB bytes 
array limit` in `DatasetLargeResultCollectingSuite` as default due it cannot 
run successfully with Spark default Java Options.
   
   ### Why are the changes needed?
   Avoid local test failure.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, just for test
   
   
   ### How was this patch tested?
   - Pass GA
   - Manual test:  in my test environment, change `-Xmx4g` to `-Xmx10g`, maven 
and sbt can test successfully in my


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



cloud-fan commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319573470

   shall we change `FileSourceMetadataAttribute`? I think the metadata column 
(at least for file source) is always not nullable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #38705: [SPARK-41173][SQL] Move `require()` out from the constructors of string expressions

2022-11-17 Thread GitBox



LuciferYang opened a new pull request, #38705:
URL: https://github.com/apache/spark/pull/38705

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



Yaohua628 commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319593235

   > shall we change `FileSourceMetadataAttribute`?
   
   I initially thought we could relax this field for some future cases. But 
yeah, you are right, it seems like it is always not null for file sources. 
   
   But do you think it will cause some compatibility issues? If this `nullable` 
has been persisted somewhere?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-11-17 Thread GitBox



github-actions[bot] commented on PR #37129:
URL: https://github.com/apache/spark/pull/37129#issuecomment-1319387624

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #37359: [SPARK-25342][CORE][SQL]Support rolling back a result stage and rerunning all result tasks when writing files

2022-11-17 Thread GitBox



github-actions[bot] commented on PR #37359:
URL: https://github.com/apache/spark/pull/37359#issuecomment-1319387601

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix 
maven test failed of `protobuf` module
URL: https://github.com/apache/spark/pull/38690


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38690:
URL: https://github.com/apache/spark/pull/38690#issuecomment-1319471042

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #37725: [DO-NOT-MERGE] Exceptions without error classes in SQL golden files

2022-11-17 Thread GitBox



panbingkun commented on PR #37725:
URL: https://github.com/apache/spark/pull/37725#issuecomment-1319476144

   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-17 Thread GitBox



Yikun commented on PR #38698:
URL: https://github.com/apache/spark/pull/38698#issuecomment-1319483376

   First test mlflow 1.30.0 and if result passed, in next commits, I will 
append the infra full refreshed (mlfow 2.0.1)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WeichenXu123 opened a new pull request, #38699: [SPARK-41188][CORE][ML] Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes

2022-11-17 Thread GitBox



WeichenXu123 opened a new pull request, #38699:
URL: https://github.com/apache/spark/pull/38699

   Signed-off-by: Weichen Xu 
   
   
   
   ### What changes were proposed in this pull request?
   
   Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark 
executor JVM processes.
   
   ### Why are the changes needed?
   
   This is for limiting the thread number for OpenBLAS routine to the number of 
cores assigned to this executor because some spark ML algorithms calls OpenBlAS 
via netlib-java
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manually.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



HyukjinKwon commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1026007189


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   Thanks man



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #38706: [TEST ONLY] Come back to collect.foreach(send)

2022-11-17 Thread GitBox



zhengruifeng opened a new pull request, #38706:
URL: https://github.com/apache/spark/pull/38706

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-17 Thread GitBox



wangyum commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1319657140

   @EnricoMi @cloud-fan Could we fix the `DeduplicateRelations`?  It did not 
generate different expression IDs for all conflicting attributes:
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
Join LeftSemi Join LeftSemi
:- Project [(id#4 + 1) AS id#6]   :- Project [(id#4 + 1) AS id#6]
:  +- Deduplicate [id#4]  :  +- Deduplicate [id#4]
: +- Project [value#1 AS id#4]: +- Project [value#1 AS id#4]
:+- LocalRelation [value#1]   :+- LocalRelation [value#1]
+- Deduplicate [id#4] +- Deduplicate [id#4]
   !   +- Project [value#1 AS id#4]  +- Project [value#8 AS id#4]
   !  +- LocalRelation [value#1]+- LocalRelation [value#8]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38611:
URL: https://github.com/apache/spark/pull/38611#issuecomment-1319432329

   I think mlflow got upgraded together for some reasons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



HeartSaVioR commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319473845

   cc. @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module

2022-11-17 Thread GitBox



LuciferYang commented on PR #38690:
URL: https://github.com/apache/spark/pull/38690#issuecomment-1319473917

   Thanks @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-17 Thread GitBox



itholic commented on code in PR #38644:
URL: https://github.com/apache/spark/pull/38644#discussion_r1025941541


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala:
##
@@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with 
QueryErrorsBase {
   Decimal("12345678901234567890123456789012345678"))
 checkExceptionInExpression[ArithmeticException](
   cast("123456789012345678901234567890123456789", DecimalType(38, 0)),
-  "Out of decimal type range")
+  "NUMERIC_OUT_OF_SUPPORTED_RANGE")

Review Comment:
   Let me just test with `checkExceptionInExpression` for now, since it failed 
in CI when use `checkError` because `checkError` doesn't support for various 
test Mode such as "non-codegen mode", "codegen mode" and "unsafe mode" for now.
   I think we might need to similar test utils for expression test such as 
`checkErrorInExpression` something like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-17 Thread GitBox



Yikun commented on PR #38698:
URL: https://github.com/apache/spark/pull/38698#issuecomment-1319485275

   cc @HyukjinKwon @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-17 Thread GitBox



itholic commented on PR #38644:
URL: https://github.com/apache/spark/pull/38644#issuecomment-1319485107

   cc @MaxGekk @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38675: [SPARK-41161][BUILD] Upgrade scala-parser-combinators to 2.1.1

2022-11-17 Thread GitBox



LuciferYang commented on PR #38675:
URL: https://github.com/apache/spark/pull/38675#issuecomment-1319538214

   pass on 2.12 and 2.13 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack

2022-11-17 Thread GitBox



sadikovi commented on code in PR #38700:
URL: https://github.com/apache/spark/pull/38700#discussion_r1025989704


##
python/pyspark/serializers.py:
##
@@ -357,7 +358,7 @@ def dumps(self, obj):
 return obj
 
 
-if sys.version_info < (3, 8):
+if sys.version_info < (3, 8) or 
os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1":

Review Comment:
   nit: I was thinking about a name like this one: 
`PYSPARK_ENABLE_NAMEDTUPLE_PATCH` but I am open to alternatives.



##
python/pyspark/serializers.py:
##
@@ -54,6 +54,7 @@
 """
 
 import sys
+import os

Review Comment:
   nit: should os come before sys?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] toujours33 opened a new pull request, #38709: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-17 Thread GitBox



toujours33 opened a new pull request, #38709:
URL: https://github.com/apache/spark/pull/38709

   ### What changes were proposed in this pull request?
   ExecutorAllocationManager only record count for speculative task, 
`stageAttemptToNumSpeculativeTasks` increment when speculative task submit, and 
only decrement when speculative task end.
   If task finished before speculative task start, the speculative task will 
never be scheduled, which will cause leak of 
`stageAttemptToNumSpeculativeTasks` and mislead the calculation of target 
executors.
   
   This PR fixes the issue by add task index in 
`SparkListenerSpeculativeTaskSubmitted` event, and record speculative task with 
task index, when task finished, the speculative task will also decrement.
   
   ### Why are the changes needed?
   To fix idle executors caused by pending speculative task from task that has 
been finished
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Add a comprehensive unit test.
   Pass the GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in the arrow collect code path

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in 
the arrow collect code path
URL: https://github.com/apache/spark/pull/38681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38694: [SPARK-41184][CONNECT] Disable flakey Fill.NA tests

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38694: [SPARK-41184][CONNECT] Disable flakey 
Fill.NA tests
URL: https://github.com/apache/spark/pull/38694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38681: [SPARK-41165][CONNECT] Avoid hangs in the arrow collect code path

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38681:
URL: https://github.com/apache/spark/pull/38681#issuecomment-1319402787

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38694: [SPARK-41184][CONNECT] Disable flakey Fill.NA tests

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38694:
URL: https://github.com/apache/spark/pull/38694#issuecomment-1319402627

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bersprockets opened a new pull request, #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`

2022-11-17 Thread GitBox



bersprockets opened a new pull request, #38697:
URL: https://github.com/apache/spark/pull/38697

   Backport of #38635
   
   ### What changes were proposed in this pull request?
   
   When a user specifies a null format in `to_number`/`try_to_number`, return 
`null`, with a data type of `DecimalType.USER_DEFAULT`, rather than throwing a 
`NullPointerException`.
   
   Also, since the code for `ToNumber` and `TryToNumber` is virtually 
identical, put all common code in new abstract class `ToNumberBase` to avoid 
fixing the bug in two places.
   
   ### Why are the changes needed?
   
   `to_number`/`try_to_number` currently throws a `NullPointerException` when 
the format is `null`:
   
   ```
   spark-sql> SELECT to_number('454', null);
   org.apache.spark.SparkException: The Spark SQL phase analysis failed with an 
internal error. Please, fill a bug report in, and provide the full stack trace.
at 
org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500)
at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185)
   ...
   Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat$lzycompute(numberFormatExpressions.scala:72)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat(numberFormatExpressions.scala:72)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter$lzycompute(numberFormatExpressions.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter(numberFormatExpressions.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.checkInputDataTypes(numberFormatExpressions.scala:81)
   ```
   Also:
   ```
   spark-sql> SELECT try_to_number('454', null);
   org.apache.spark.SparkException: The Spark SQL phase analysis failed with an 
internal error. Please, fill a bug report in, and provide the full stack trace.
at 
org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500)
at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185)
   ...
   Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat$lzycompute(numberFormatExpressions.scala:72)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormat(numberFormatExpressions.scala:72)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter$lzycompute(numberFormatExpressions.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.numberFormatter(numberFormatExpressions.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.ToNumber.checkInputDataTypes(numberFormatExpressions.scala:81)
at 
org.apache.spark.sql.catalyst.expressions.TryToNumber.checkInputDataTypes(numberFormatExpressions.scala:146)
   ```
   Compare to `to_binary` and `try_to_binary`:
   ```
   spark-sql> SELECT to_binary('abc', null);
   NULL
   Time taken: 3.111 seconds, Fetched 1 row(s)
   spark-sql> SELECT try_to_binary('abc', null);
   NULL
   Time taken: 0.06 seconds, Fetched 1 row(s)
   spark-sql>
   ```
   Also compare to `to_number` in PostgreSQL 11.18:
   ```
   SELECT to_number('454', null) is null as a;
   a
   true
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   `to_number`/`try_to_number` with null format will now return `null` with a 
data type of `DecimalType.USER_DEFAULT`.
   
   ### How was this patch tested?
   
   New unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun opened a new pull request, #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-17 Thread GitBox



Yikun opened a new pull request, #38698:
URL: https://github.com/apache/spark/pull/38698

   ### What changes were proposed in this pull request?
   This patch replace `list_run_infos` with `search_runs` in mlflow doctest.
   Since mlfow 1.29.0 
(https://github.com/mlflow/mlflow/commit/fb9abf98595c1b32117aac90b0b2170ce120cd9f),
 `list_run_infos` is deprecated and instead by `search_runs`
   
   ### Why are the changes needed?
   We hit this error after infra upgrade: 
https://github.com/apache/spark/pull/38611#issuecomment-1319430302
   
   ### Does this PR introduce _any_ user-facing change?
   No, test only
   
   ### How was this patch tested?
   Test passed with mlflow 1.30.0 and 2.0.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025982465


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   https://github.com/apache/spark/pull/38704 ignore this case as default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack

2022-11-17 Thread GitBox



HyukjinKwon commented on code in PR #38700:
URL: https://github.com/apache/spark/pull/38700#discussion_r1026007445


##
python/pyspark/serializers.py:
##
@@ -357,7 +358,7 @@ def dumps(self, obj):
 return obj
 
 
-if sys.version_info < (3, 8):
+if sys.version_info < (3, 8) or 
os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1":

Review Comment:
   ```suggestion
   if sys.version_info < (3, 8) or 
os.environ.get("PYSPARK_ENABLE_NAMEDTUPLE_PATCH") == "1":
   ```



##
python/pyspark/serializers.py:
##
@@ -471,7 +472,7 @@ def loads(self, obj, encoding="bytes"):
 return cloudpickle.loads(obj, encoding=encoding)
 
 
-if sys.version_info < (3, 8):
+if sys.version_info < (3, 8) or 
os.environ.get("PYSPARK_MONKEYPATCH_NAMEDTUPLE") == "1":

Review Comment:
   ```suggestion
   if sys.version_info < (3, 8) or 
os.environ.get("PYSPARK_ENABLE_NAMEDTUPLE_PATCH") == "1":
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack

2022-11-17 Thread GitBox



HyukjinKwon commented on code in PR #38700:
URL: https://github.com/apache/spark/pull/38700#discussion_r1026007386


##
python/pyspark/serializers.py:
##
@@ -54,6 +54,7 @@
 """
 
 import sys
+import os

Review Comment:
   eh, it's actually fine in Python import (per PEP 8)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #38680: [SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils

2022-11-17 Thread GitBox



HeartSaVioR closed pull request #38680: 
[SPARK-40657][PROTOBUF][FOLLOWUP][MINOR] Add clarifying comment in ProtobufUtils
URL: https://github.com/apache/spark/pull/38680


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.

2022-11-17 Thread GitBox



LuciferYang commented on PR #38609:
URL: https://github.com/apache/spark/pull/38609#issuecomment-1319466822

   Thansks @HyukjinKwon @grundprinzip @amaliujia  ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



HeartSaVioR commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319473783

   Maybe simpler to apply `KnownNullable` / `KnownNotNull` against 
`CreateStruct` to enforce desired nullability? Please refer the change in 
#35543.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025954405


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   @liuzqt Could you tell me how to run the local test successfully with Spark 
default Java options?
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042

2022-11-17 Thread GitBox



panbingkun opened a new pull request, #38707:
URL: https://github.com/apache/spark/pull/38707

   ### What changes were proposed in this pull request?
   In the PR, I propose to assign a name to the error class 
_LEGACY_ERROR_TEMP_1042.
   
   ### Why are the changes needed?
   Proper names of error classes should improve user experience with Spark SQL.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37725: [DO-NOT-MERGE] Exceptions without error classes in SQL golden files

2022-11-17 Thread GitBox



LuciferYang commented on PR #37725:
URL: https://github.com/apache/spark/pull/37725#issuecomment-1319616814

   - SPARK-41173: https://github.com/apache/spark/pull/38705


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI

2022-11-17 Thread GitBox



Yikun commented on PR #38611:
URL: https://github.com/apache/spark/pull/38611#issuecomment-1319457396

   Yes, new version mlflow had some breaking change, you could first install 
memory-profile in the end of dockerfile(like connect).
   
   I will find sometime today to fix the doctest for new mlflow, in a separate 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma

2022-11-17 Thread GitBox



cloud-fan commented on PR #38691:
URL: https://github.com/apache/spark/pull/38691#issuecomment-1319496802

   thanks for review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wineternity opened a new pull request, #38702: SPARK-41187 [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

2022-11-17 Thread GitBox



wineternity opened a new pull request, #38702:
URL: https://github.com/apache/spark/pull/38702

   ### What changes were proposed in this pull request?
   Ignore the SparkListenerTaskEnd with Reason "Resubmitted" to avoid memory 
leak
   
   
   ### Why are the changes needed?
   For a long running spark thriftserver,  LiveExecutor will be accumulated in 
the deadExecutors HashMap and cause message event queue processing slowly
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   New UT Added
   Test in thriftserver env
   
   ### The way to reproduce
   I try to reproduce it in spark shell, but it is a little bit handy
   1.  start spark-shell , set spark.dynamicAllocation.maxExecutors=2 for 
convient
   ` bin/spark-shell  --driver-java-options 
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8006"`
   2. run a job with shuffle
`sc.parallelize(1 to 1000, 10).map { x => Thread.sleep(1000) ; (x % 3, x) 
}.reduceByKey((a, b) => a + b).collect()`
   3. After some ShuffleMapTask finished, kill one or two executor to let tasks 
resubmitted
   4. check by heap dump or debug
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025960041


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #38702: SPARK-41187 [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen

2022-11-17 Thread GitBox



itholic commented on PR #38702:
URL: https://github.com/apache/spark/pull/38702#issuecomment-1319566712

   Can we change the JIRA format in the title such as "[SPARK-41187][CORE] ...".
   
   Check the [Spark contribution 
guide](https://spark.apache.org/contributing.html) also would helpful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` for testing to `protobuf` module

2022-11-17 Thread GitBox



LuciferYang opened a new pull request, #38708:
URL: https://github.com/apache/spark/pull/38708

   ### What changes were proposed in this pull request?
   This pr add a `log4j2.properties` file for testing to `protobuf` module as 
others.
   
   
   ### Why are the changes needed?
   Should print test log to `target/unit-tests.log` as other module
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   - Pass Github Actions
   - Manual test
   
   run 
   
   ```
   build/mvn clean install -DskipTests -pl connector/protobuf -am 
   build/mvn test -DskipTests -pl connector/protobuf 
   ```
   
   **Before** 
   
   The log is printed to the console.
   
   **After**
   
   The log is printed to `target/unit-tests.log` .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



Yaohua628 commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319636434

   > If it has been persisted before (like a table), then it's totally fine to 
write non-nullable data to a nullable column. The optimizer may also optimize a 
column from nullable to non-nullable, so this will happen from time to time.
   
   Got it, that makes sense! Updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #38695: [TEST ONLY][DO NOT MERGE]. Test the schema of `collect`

2022-11-17 Thread GitBox



zhengruifeng opened a new pull request, #38695:
URL: https://github.com/apache/spark/pull/38695

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38611:
URL: https://github.com/apache/spark/pull/38611#issuecomment-1319430302

   The test failure this time seems different:
   
   ```
   

   
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 168, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   run_info = client.list_run_infos(exp_id)[-1]
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   run_info = client.list_run_infos(exp_id)[-1]
   AttributeError: 'MlflowClient' object has no attribute 'list_run_infos'
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 169, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   model = 
load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid))
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   model = 
load_model("runs:/{run_id}/model".format(run_id=run_info.run_uuid))
   NameError: name 'run_info' is not defined
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 171, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   prediction_df["prediction"] = model.predict(prediction_df)
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   prediction_df["prediction"] = model.predict(prediction_df)
   NameError: name 'model' is not defined
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   prediction_df
   Expected:
   x1   x2  prediction
   0  2.0  4.01.31
   Got:
   x1   x2
   0  2.0  4.0
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 178, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   model.predict(prediction_df[["x1", "x2"]].to_pandas())
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   model.predict(prediction_df[["x1", "x2"]].to_pandas())
   NameError: name 'model' is not defined
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 189, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   y = model.predict(features)
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   y = model.predict(features)
   NameError: name 'model' is not defined
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 198, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   features['y'] = y
   Exception raised:
   Traceback (most recent call last):
 File "/usr/lib/python3.9/doctest.py", line 1336, in __run
   exec(compile(example.source, filename, "single",
 File "", line 1, in 

   features['y'] = y
   NameError: name 'y' is not defined
   **
   File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 200, in 
pyspark.pandas.mlflow.load_model
   Failed example:
   everything
   Expected:
   x1   x2  z y
   0  2.0  3.0 -1  1.376932
   Got:
   x1   x2  z
   0  2.0  3.0 -1
   **
  8 of  26 in pyspark.pandas.mlflow.load_model
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe,

[GitHub] [spark] HyukjinKwon commented on pull request #38698: [SPARK-41186][PS][TESTS] Replace `list_run_infos` with `search_runs` in mlflow doctest

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38698:
URL: https://github.com/apache/spark/pull/38698#issuecomment-1319487583

   cc @harupy @WeichenXu123 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request, #38700: [SPARK-41189][PYTHON] Add an environment to switch on and off namedtuple hack

2022-11-17 Thread GitBox



HyukjinKwon opened a new pull request, #38700:
URL: https://github.com/apache/spark/pull/38700

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/34688 that 
adds a switch to turn on and off the namedtuple hack.
   
   ### Why are the changes needed?
   
   There are still behaviour differences between regular pickle and Cloudpickle 
e.g., bug fixes from the upstream. It's safer to have a switch to turn on and 
off for the time being.
   
   ### Does this PR introduce _any_ user-facing change?
   
   This remains as an internal environment so ideally no. In fact the main 
change itself was the internal change too.
   
   ### How was this patch tested?
   
   Manually tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025955547


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   @HyukjinKwon @mridulm @jiangxb1987 @sadikovi @Ngone51 I failed to run this 
case locally with Spark default Java options. Could you run successfully? Or 
should we ignore this case by default?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025953966


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   hmm... I test this suite with **Java 8/11/17** on Linux/MacOS On Apple 
Silicon with following commands:
   
   - Maven:
   
   ```
   build/mvn clean install -DskipTests -pl sql/core -am
   build/mvn clean test -pl sql/core -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite
   ```
   
   and 
   
   
   
   ```
   dev/change-scala-version.sh 2.13 
   build/mvn clean install -DskipTests -pl sql/core -am -Pscala-2.13
   build/mvn clean test -pl sql/core -Pscala-2.13 -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite
   ```
   
   - SBT:
   
   ```
   build/sbt clean "sql/testOnly 
org.apache.spark.sql.DatasetLargeResultCollectingSuite"
   ```
   
   
   ```
   dev/change-scala-version.sh 2.13 
   build/sbt clean "sql/testOnly 
org.apache.spark.sql.DatasetLargeResultCollectingSuite" -Pscala-2.13
   ```
   
   
   All test failed with `java.lang.OutOfMemoryError: Java heap space` as 
follows:
   
   ```
   10:19:56.910 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
in stage 0.0 (TID 0)
   java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
   at 
org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1(SerializerHelper.scala:40)
   at 
org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1$adapted(SerializerHelper.scala:40)
   at 
org.apache.spark.serializer.SerializerHelper$$$Lambda$2321/1995130077.apply(Unknown
 Source)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
   at 
java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
   at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
   at 
org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1(Utils.scala:271)
   at 
org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1$adapted(Utils.scala:271)
   at org.apache.spark.util.Utils$$$Lambda$2324/69671223.apply(Unknown 
Source)
   at org.apache.spark.util.Utils$.writeByteBufferImpl(Utils.scala:249)
   at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:271)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2(ChunkedByteBuffer.scala:103)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2$adapted(ChunkedByteBuffer.scala:103)
   at 
org.apache.spark.util.io.ChunkedByteBuffer$$Lambda$2323/1073743200.apply(Unknown
 Source)
   at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.writeExternal(ChunkedByteBuffer.scala:103)
   at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
   at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
   at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
   at 
org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:599)
   ```
   



-- 
This is an automated message from the Apache Git

[GitHub] [spark] dongjoon-hyun commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10

2022-11-17 Thread GitBox



dongjoon-hyun commented on PR #38262:
URL: https://github.com/apache/spark/pull/38262#issuecomment-1319552102

   The feature release branches like branch-3.3 will, generally, be maintained 
with bug fix releases for a period of 18 months. We usually have 3 bug fix 
releases. Since 3.3.1 on released on Oct 25, 3.3.2 will be February or March 
2023.
   
   Apache Spark 3.4.0 preparation is going to happen in the similar timeframe. 
So, v3.3.2 schedule might be adjusted accordingly.
   
   https://spark.apache.org/versioning-policy.html
   
   ![Screenshot 2022-11-17 at 9 03 13 
PM](https://user-images.githubusercontent.com/9700541/202622566-33d8a868-3f98-46a3-9c70-f1b328ff29b8.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38609:
URL: https://github.com/apache/spark/pull/38609#issuecomment-1319422162

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38609: [SPARK-40593][BUILD][CONNECT] Support 
user configurable `protoc` and `protoc-gen-grpc-java` executables when building 
Spark Connect.
URL: https://github.com/apache/spark/pull/38609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #38696: [SPARK-41175][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1078

2022-11-17 Thread GitBox



panbingkun opened a new pull request, #38696:
URL: https://github.com/apache/spark/pull/38696

   ### What changes were proposed in this pull request?
   In the PR, I propose to assign a name to the error class 
_LEGACY_ERROR_TEMP_1078.
   
   ### Why are the changes needed?
   Proper names of error classes should improve user experience with Spark SQL.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   By running the affected test suites:
   > $ build/sbt "catalyst/testOnly *SessionCatalogSuite"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module

2022-11-17 Thread GitBox



LuciferYang commented on PR #38690:
URL: https://github.com/apache/spark/pull/38690#issuecomment-1319467631

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-17 Thread GitBox



itholic commented on code in PR #38644:
URL: https://github.com/apache/spark/pull/38644#discussion_r1025941541


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala:
##
@@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with 
QueryErrorsBase {
   Decimal("12345678901234567890123456789012345678"))
 checkExceptionInExpression[ArithmeticException](
   cast("123456789012345678901234567890123456789", DecimalType(38, 0)),
-  "Out of decimal type range")
+  "NUMERIC_OUT_OF_SUPPORTED_RANGE")

Review Comment:
   Let me just test with `checkExceptionInExpression` for now, since it failed 
in CI when use `checkError` because `checkError` doesn't support for various 
test Mode such as "non-codegen mode", "codegen mode" and "unsafe mode" for now. 
(so, it failed in CI: 
https://github.com/apache/spark/pull/38644#discussion_r1021424319)
   I think we might need to similar test utils for expression test such as 
`checkErrorInExpression` something like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Remove the limitation that single task result must fit in 2GB

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r1025953966


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2228,6 +2231,31 @@ class DatasetSuite extends QueryTest
   }
 }
 
+class DatasetLargeResultCollectingSuite extends QueryTest
+  with SharedSparkSession {
+
+  override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
+  test("collect data with single partition larger than 2GB bytes array limit") 
{
+// This test requires large memory and leads to OOM in Github Action so we 
skip it. Developer
+// should verify it in local build.
+assume(!sys.env.contains("GITHUB_ACTIONS"))

Review Comment:
   
   hmm... I test this suite with Java 8/11/17 on Linux/MacOS On Apple Silicon 
with following commands:
   
   - Maven:
   
   ```
   build/mvn clean install -DskipTests -pl sql/core -am
   build/mvn clean test -pl sql/core -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite
   ```
   
   and 
   
   
   
   ```
   dev/change-scala-version.sh 2.13 
   build/mvn clean install -DskipTests -pl sql/core -am -Pscala-2.13
   build/mvn clean test -pl sql/core -Pscala-2.13 -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.DatasetLargeResultCollectingSuite
   ```
   
   - SBT:
   
   ```
   build/sbt clean "sql/testOnly 
org.apache.spark.sql.DatasetLargeResultCollectingSuite"
   ```
   
   
   ```
   dev/change-scala-version.sh 2.13 
   build/sbt clean "sql/testOnly 
org.apache.spark.sql.DatasetLargeResultCollectingSuite" -Pscala-2.13
   ```
   
   
   All test failed with `java.lang.OutOfMemoryError: Java heap space` as 
follows:
   
   ```
   10:19:56.910 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
in stage 0.0 (TID 0)
   java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
   at 
org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1(SerializerHelper.scala:40)
   at 
org.apache.spark.serializer.SerializerHelper$.$anonfun$serializeToChunkedBuffer$1$adapted(SerializerHelper.scala:40)
   at 
org.apache.spark.serializer.SerializerHelper$$$Lambda$2321/1995130077.apply(Unknown
 Source)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
   at 
java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
   at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
   at 
org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1(Utils.scala:271)
   at 
org.apache.spark.util.Utils$.$anonfun$writeByteBuffer$1$adapted(Utils.scala:271)
   at org.apache.spark.util.Utils$$$Lambda$2324/69671223.apply(Unknown 
Source)
   at org.apache.spark.util.Utils$.writeByteBufferImpl(Utils.scala:249)
   at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:271)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2(ChunkedByteBuffer.scala:103)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.$anonfun$writeExternal$2$adapted(ChunkedByteBuffer.scala:103)
   at 
org.apache.spark.util.io.ChunkedByteBuffer$$Lambda$2323/1073743200.apply(Unknown
 Source)
   at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
   at 
org.apache.spark.util.io.ChunkedByteBuffer.writeExternal(ChunkedByteBuffer.scala:103)
   at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
   at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
   at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
   at 
org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:599)
   ```
   



-- 
This is an automated message from the Apache Git

[GitHub] [spark] zhengruifeng opened a new pull request, #38701: [TEST ONLY][DO NOT MERGE] Test collect after avoiding hang with arrow-collect

2022-11-17 Thread GitBox



zhengruifeng opened a new pull request, #38701:
URL: https://github.com/apache/spark/pull/38701

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mcdull-zhang opened a new pull request, #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

2022-11-17 Thread GitBox



mcdull-zhang opened a new pull request, #38703:
URL: https://github.com/apache/spark/pull/38703

   ### What changes were proposed in this pull request?
   For example the following statement:
   ```sql
   cache table t1 as select a from testData3 group by a;
   cache table t2 as select a,b from testData2 where a in (select a from t1);
   select key,value,b from testData t3 join t2 on t3.key=t2.a;
   ```
   The cached t2 is not used in the third statement
   
   before pr:
   ```tex
   Project [key#13, value#14, b#24]
   +- SortMergeJoin [key#13], [a#23], Inner
  :- BroadcastHashJoin [key#13], [a#359], LeftSemi, BuildRight, false
  :  :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
  :  :  +- Scan[obj#12]
  :  +- Scan In-memory table t1 [a#359]
  :+- InMemoryRelation [a#359], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  :  +- *(2) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
  : +- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
  :+- *(1) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
  :   +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
  :  +- Scan[obj#32]
  +- BroadcastHashJoin [a#23], [a#359], LeftSemi, BuildRight, false
 :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
 :  +- Scan[obj#22]
 +- Scan In-memory table t1 [a#359]
   +- InMemoryRelation [a#359], StorageLevel(disk, memory, 
deserialized, 1 replicas)
 +- *(2) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
+- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
   +- *(1) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
 +- Scan[obj#32]
   ```
   
   after pr:
   ```tex
   Project [key#13, value#14, b#358]
   +- BroadcastHashJoin [key#13], [a#357], Inner, BuildRight, false
  :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
  :  +- Scan[obj#12]
  +- Scan In-memory table t2 [a#357, b#358]
+- InMemoryRelation [a#357, b#358], StorageLevel(disk, memory, 
deserialized, 1 replicas)
  +- *(1) BroadcastHashJoin [a#23], [a#261], LeftSemi, 
BuildRight, false
 :- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
 :  +- Scan[obj#22]
 +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=155]
+- Scan In-memory table t1 [a#261]
  +- InMemoryRelation [a#261], StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(2) HashAggregate(keys=[a#33], 
functions=[], output=[a#33])
   +- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
  +- *(1) HashAggregate(keys=[a#33], 
functions=[], output=[a#33])
 +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
+- Scan[obj#32]
   ```
   
   
   ### Why are the changes needed?
   performance improvement
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   added test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail:

[GitHub] [spark] Yaohua628 commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



Yaohua628 commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319529815

   > Maybe simpler to apply KnownNullable / KnownNotNull against CreateStruct 
to enforce desired nullability? Please refer the change in 
https://github.com/apache/spark/pull/35543.
   
   Wow, good point, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38638: [SPARK-41122][CONNECT] Explain API can 
support different modes
URL: https://github.com/apache/spark/pull/38638


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38638:
URL: https://github.com/apache/spark/pull/38638#issuecomment-1319539398

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`

2022-11-17 Thread GitBox



MaxGekk commented on code in PR #38576:
URL: https://github.com/apache/spark/pull/38576#discussion_r1026031578


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 def failOnInvalidOuterReference(p: LogicalPlan): Unit = {
   p.expressions.foreach(checkMixedReferencesInsideAggregateExpr)
   if (!canHostOuter(p) && p.expressions.exists(containsOuter)) {
+val exprs = new ListBuffer[String]()
+for (expr <- p.expressions) {

Review Comment:
   This can be simpler using `filter` instead of `exists` and `for`



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 def failOnInvalidOuterReference(p: LogicalPlan): Unit = {
   p.expressions.foreach(checkMixedReferencesInsideAggregateExpr)
   if (!canHostOuter(p) && p.expressions.exists(containsOuter)) {
+val exprs = new ListBuffer[String]()
+for (expr <- p.expressions) {
+  if (containsOuter(expr)) {
+exprs += toPrettySQL(expr)

Review Comment:
   Use `toSQLExpr`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TESTS] Install memory-profiler in the CI

2022-11-17 Thread GitBox



xinrong-meng commented on PR #38611:
URL: https://github.com/apache/spark/pull/38611#issuecomment-1319423102

   Do you happen to know if it's normal to fail so many times? @Yikun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma

2022-11-17 Thread GitBox



cloud-fan closed pull request #38691: [SPARK-41178][SQL] Fix parser rule 
precedence between JOIN and comma
URL: https://github.com/apache/spark/pull/38691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #38262: [SPARK-40801][BUILD] Upgrade `Apache commons-text` to 1.10

2022-11-17 Thread GitBox



dongjoon-hyun commented on PR #38262:
URL: https://github.com/apache/spark/pull/38262#issuecomment-1319545728

   Apache Spark has a pre-defined release cadence, @vitas and @bjornjorgensen .
   - https://spark.apache.org/versioning-policy.html
   
   ![Screenshot 2022-11-17 at 8 56 29 
PM](https://user-images.githubusercontent.com/9700541/202620165-e2645b89-dc9f-4529-86f0-eba6d74f42f8.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38683: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-17 Thread GitBox



cloud-fan commented on PR #38683:
URL: https://github.com/apache/spark/pull/38683#issuecomment-1319609369

   If it has been persisted before (like a table), then it's totally fine to 
write non-nullable data to a nullable. The optimizer may also optimize a column 
from nullable to non-nullable, so this will happen from time to time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092

2022-11-17 Thread GitBox



panbingkun opened a new pull request, #38710:
URL: https://github.com/apache/spark/pull/38710

   ### What changes were proposed in this pull request?
   In the PR, I propose to assign a name to the error class 
_LEGACY_ERROR_TEMP_1092.
   
   ### Why are the changes needed?
   Proper names of error classes should improve user experience with Spark SQL.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dengziming commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-17 Thread GitBox



dengziming commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1024865486


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -213,7 +213,7 @@ message Deduplicate {
 
 message LocalRelation {
   repeated Expression.QualifiedAttribute attributes = 1;

Review Comment:
   I find we lack a `fromBatchWithSchemaIterator` method correspond to 
`toBatchWithSchemaIterator`, so I will implement one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-17 Thread GitBox



ulysses-you commented on PR #38687:
URL: https://github.com/apache/spark/pull/38687#issuecomment-1318482662

   cc @cloud-fan @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`

2022-11-17 Thread GitBox



itholic commented on code in PR #38576:
URL: https://github.com/apache/spark/pull/38576#discussion_r1025069390


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -1059,8 +1059,8 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   if (!canHostOuter(p) && p.expressions.exists(containsOuter)) {
 p.failAnalysis(
   errorClass =
-
"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE",
-  messageParameters = Map("treeNode" -> planToString(p)))
+"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.CORRELATED_REFERENCE",
+  messageParameters = Map("sqlExprs" -> 
p.expressions.map(_.sql).mkString(",")))

Review Comment:
   Oh, I got it. Let me address it.
   
   Thanks!!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-17 Thread GitBox



EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1318255174

   @wangyum @cloud-fan appreciate your suggestion on how to test this bug in 
`LeftSemiAntiJoinPushDownSuite` (see 
https://github.com/apache/spark/pull/38676#issuecomment-1317220559).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38638: [SPARK-41122][CONNECT] Explain API can support different modes

2022-11-17 Thread GitBox



HyukjinKwon commented on code in PR #38638:
URL: https://github.com/apache/spark/pull/38638#discussion_r1025039998


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -667,12 +668,70 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
-def explain(self) -> str:
+def explain(
+self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] 
= None

Review Comment:
   Actually NVM. Let's just go ahead as is for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide

2022-11-17 Thread GitBox



HyukjinKwon closed pull request #38678: [SPARK-41164][CONNECT] Update 
relations.proto to follow Connect proto development guide
URL: https://github.com/apache/spark/pull/38678


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide

2022-11-17 Thread GitBox



HyukjinKwon commented on PR #38678:
URL: https://github.com/apache/spark/pull/38678#issuecomment-1318461338

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.

2022-11-17 Thread GitBox



LuciferYang commented on PR #38609:
URL: https://github.com/apache/spark/pull/38609#issuecomment-1318498330

   Both pr title and description have been updated @grundprinzip Thanks ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dengziming commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-17 Thread GitBox



dengziming commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1024864116


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -213,7 +213,7 @@ message Deduplicate {
 
 message LocalRelation {
   repeated Expression.QualifiedAttribute attributes = 1;
-  // TODO: support local data.
+  repeated bytes data = 2;

Review Comment:
   Thank you, I use `repeated bytes`  in case that the batch size is lager than 
maxRecordsPerBatch,  I think is enough to use `bytes` here since 
`LocalRelation` is mostly used in debugging cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and `protoc-gen-grpc-java` executables when building Spark Connect.

2022-11-17 Thread GitBox



grundprinzip commented on PR #38609:
URL: https://github.com/apache/spark/pull/38609#issuecomment-1318500167

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #38688: [WIP][SPARK-41166][TESTS] Check errorSubClass of DataTypeMismatch in *ExpressionSuites

2022-11-17 Thread GitBox



panbingkun opened a new pull request, #38688:
URL: https://github.com/apache/spark/pull/38688

   ### What changes were proposed in this pull request?
   The pr aims to check errorSubClass of DataTypeMismatch in *ExpressionSuites
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #38609: [SPARK-40593][BUILD][CONNECT] Make user can build and test `connect` module by specifying the user-defined `protoc` and `protoc-gen-grpc

2022-11-17 Thread GitBox



grundprinzip commented on PR #38609:
URL: https://github.com/apache/spark/pull/38609#issuecomment-1318486206

   Since I cannot directly edit the PR description or title, I would kindly ask 
you to do the following changes:
   
   Title: [SPARK-40593][BUILD][CONNECT] Support user configurable `protoc` and 
`protoc-gen-grpc-java` executables when building Spark Connect.
   
   > ### What changes were proposed in this pull request?
   > This PR adds a new profile named `-Puser-defined-protoc` to support that 
users can build and test `connect` module by specifying custom `protoc ` and 
`protoc-gen-grpc-java` executables.
   > 
   > ### Why are the changes needed?
   > As described in 
[SPARK-40593](https://issues.apache.org/jira/browse/SPARK-40593), the latest 
versions of `protoc` and `protoc-gen-grpc-java` have minimum version 
requirements for basic libraries such as `glibc` and `glibcxx`. Because of that 
it is not possible to compile the `connect` module out of the box on CentOS 6 
or CentOS 7. Instead the following error messages is shown:
   > 
   > ```
   > [ERROR] PROTOC FAILED: 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe:
 /lib64/libc.so.6: version `GLIBC_2.14' not found (required by 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
   > 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe:
 /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
   > 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe:
 /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
   > 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe:
 /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by 
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
 
   > ```
   > 
   > ### Does this PR introduce _any_ user-facing change?
   > No, the way to using official pre-release `protoc` and 
`protoc-gen-grpc-java` binary files is activated by default.
   > 
   > ### How was this patch tested?
   > * Pass GitHub Actions
   > * Manual test on CentOS6u3 and CentOS7u4
   > 
   > ```shell
   > export CONNECT_PROTOC_EXEC_PATH=/path-to-protoc-exe
   > export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe
   > ./build/mvn clean install -pl connector/connect -Puser-defined-protoc -am 
-DskipTests
   > ./build/mvn clean test -pl connector/connect -Puser-defined-protoc 
   > ```
   > 
   > and
   > 
   > ```shell
   > export CONNECT_PROTOC_EXEC_PATH=/path-to-protoc-exe
   > export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe
   > ./build/sbt clean "connect/compile" -Puser-defined-protoc
   > ./build/sbt  "connect/test" -Puser-defined-protoc
   > ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #38651: [SPARK-41136][K8S] Shorten graceful shutdown time of ExecutorPodsSnapshotsStoreImpl to prevent blocking shutdown process

2022-11-17 Thread GitBox



pan3793 commented on PR #38651:
URL: https://github.com/apache/spark/pull/38651#issuecomment-1318542102

   @dongjoon-hyun @LuciferYang would you please take a look again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer opened a new pull request, #38689: [WIP][SPARK-41171][SQL] Push down filter through window when partitionSpec is empty

2022-11-17 Thread GitBox



beliefer opened a new pull request, #38689:
URL: https://github.com/apache/spark/pull/38689

   ### What changes were proposed in this pull request?
   Sometimes, the SQL exists filter which condition compares rank-like window 
functions with number. For example,
   `SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM Tab1 WHERE rn <= 5`
   We can create a `Limit(5)` and push down it as the child of `Window`.
   `SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM (SELECT * FROM Tab1 
ORDER BY a LIMIT 5) t`
   
   
   ### Why are the changes needed?
   Improve the performance.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Just update the inner implementation.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-17 Thread GitBox



zhengruifeng opened a new pull request, #38686:
URL: https://github.com/apache/spark/pull/38686

   ### What changes were proposed in this pull request?
   Implement `DataFrame.drop` with a proto message
   
   ### Why are the changes needed?
   for api coverage
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you opened a new pull request, #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-17 Thread GitBox



ulysses-you opened a new pull request, #38687:
URL: https://github.com/apache/spark/pull/38687

   
   
   ### What changes were proposed in this pull request?
   
   Add TimeTravelSpec to the key of relation cache in AnalysisContext.
   
   ### Why are the changes needed?
   
   Correct the relation resolution for the same table but different 
TimeTravelSpec.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   yes, bug fix
   
   ### How was this patch tested?
   
   add test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-17 Thread GitBox



mridulm commented on PR #38560:
URL: https://github.com/apache/spark/pull/38560#issuecomment-1318238822

   I will try to get to this later this week, do let me know if you are still 
working on it though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-11-17 Thread GitBox



mridulm commented on PR #37922:
URL: https://github.com/apache/spark/pull/37922#issuecomment-1318239964

   I will try to get to this later this week, do let me know if you are still 
working on it/have pending comments to address. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order

2022-11-17 Thread GitBox



cloud-fan commented on code in PR #38684:
URL: https://github.com/apache/spark/pull/38684#discussion_r1024884862


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -371,7 +371,8 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
 
   val finalFilters = normalizedFilters.map(projectionFunc)
-  val withFilter = 
finalFilters.foldRight[LogicalPlan](scanRelation)((cond, plan) => {
+  // bottom-most filters are put in the left of the list.
+  val withFilter = finalFilters.foldLeft[LogicalPlan](scanRelation)((cond, 
plan) => {

Review Comment:
   ```suggestion
 val withFilter = 
finalFilters.foldLeft[LogicalPlan](scanRelation)((plan, cond) => {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan opened a new pull request, #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma

2022-11-17 Thread GitBox



cloud-fan opened a new pull request, #38691:
URL: https://github.com/apache/spark/pull/38691

   
   
   ### What changes were proposed in this pull request?
   
   This PR fixes a long-standing parser bug in Spark: `JOIN` should take 
precedence over comma when combining relations. For example, `FROM t1, t2 JOIN 
t3` should result to `t1 X (t2 X t3)`. However, it's `(t1 X t2) X t3` today.
   
   You can easily verify this behavior in other databases by running the query 
`SELECT * FROM t1, t2 JOIN t3 ON t1.c and t3.c`. It should fail as `t1.c` is 
not available when joining t2 and t3. I tested MySQL, Oracle and PostgreSQL, 
all of them fail, but Spark can run this query due to the wrong join order.
   
   However, this bug has a large impact: it changes the join order which can 
lead to worse performance or unexpected analysis errors. To be safe, this PR 
hides the fix under a new ANSI sub-config.
   
   ### Why are the changes needed?
   
   bug fix
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, as the fix is turned off by default.
   
   ### How was this patch tested?
   
   new tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #38690: [SPARK-41177][PROTOBUF][TESTS] Fix maven test failed of `protobuf` module

2022-11-17 Thread GitBox



LuciferYang opened a new pull request, #38690:
URL: https://github.com/apache/spark/pull/38690

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression

2022-11-17 Thread GitBox



LuciferYang commented on code in PR #38312:
URL: https://github.com/apache/spark/pull/38312#discussion_r1025353875


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala:
##
@@ -1040,6 +1040,14 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 }
   }
 
+  test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, 
true)") {
+val testDataPath = 
getClass.getResource("/test-data/timestamp-nanos.parquet")

Review Comment:
   Can we reuse 
   
   
https://github.com/apache/spark/blob/f01a8db4bcf09c4975029e85722053ff82f8a355/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala#L462-L467



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order

2022-11-17 Thread GitBox



cloud-fan commented on PR #38684:
URL: https://github.com/apache/spark/pull/38684#issuecomment-1318837045

   thanks for review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38678: [SPARK-41164][CONNECT] Update relations.proto to follow Connect proto development guide

2022-11-17 Thread GitBox



cloud-fan commented on code in PR #38678:
URL: https://github.com/apache/spark/pull/38678#discussion_r1025196133


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -106,24 +113,39 @@ message Project {
   //
   // For example, `SELECT ABS(-1)` is valid plan without an input plan.
   Relation input = 1;
+
+  // (Required) A Project requires at least one expression.
   repeated Expression expressions = 3;
 }
 
 // Relation that applies a boolean expression `condition` on each row of 
`input` to produce
 // the output result.
 message Filter {
+  // (Required) Input relation for a Filter.
   Relation input = 1;
+
+  // (Required) A Filter must have a condition expression.
   Expression condition = 2;
 }
 
 // Relation of type [[Join]].
 //
 // `left` and `right` must be present.
 message Join {
+  // (Required) Left input relation for a Join.
   Relation left = 1;
+
+  // (Required) Right input relation for a Join.
   Relation right = 2;
+
+  // (Optional) The join condition. Could be unset when `using_columns` is 
utilized.
+  //
+  // This field does not co-exist with using_columns.
   Expression join_condition = 3;

Review Comment:
   why is this not optional?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters

2022-11-17 Thread GitBox



peter-toth commented on PR #37630:
URL: https://github.com/apache/spark/pull/37630#issuecomment-1318682439

   > @peter-toth Could you fix the conflicts again?
   
   Sure, done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38691: [SPARK-41178][SQL] Fix parser rule precedence between JOIN and comma

2022-11-17 Thread GitBox



cloud-fan commented on PR #38691:
URL: https://github.com/apache/spark/pull/38691#issuecomment-1318810977

   cc @viirya @dongjoon-hyun @gengliangwang @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order

2022-11-17 Thread GitBox



cloud-fan closed pull request #38684: [SPARK-41017][SQL][FOLLOWUP] Respect the 
original Filter operator order
URL: https://github.com/apache/spark/pull/38684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo