[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836216634






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836192784


   **[Test build #138321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138321/testReport)**
 for PR 32399 at commit 
[`a1724ab`](https://github.com/apache/spark/commit/a1724ab3c4bb852dcb227bced236fdcbd3f3b93f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836190123


   **[Test build #138320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)**
 for PR 32399 at commit 
[`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836190600


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836190579


   **[Test build #138320 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)**
 for PR 32399 at commit 
[`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761).
* This patch **fails RAT tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836190600


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138320/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836190123


   **[Test build #138320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)**
 for PR 32399 at commit 
[`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836187589


   **[Test build #138319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)**
 for PR 32399 at commit 
[`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836188026


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138319/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836188026


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138319/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836188006


   **[Test build #138319 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)**
 for PR 32399 at commit 
[`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2).
* This patch **fails RAT tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836187589


   **[Test build #138319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)**
 for PR 32399 at commit 
[`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes

2021-05-09 Thread GitBox


SparkQA commented on pull request #32031:
URL: https://github.com/apache/spark/pull/32031#issuecomment-836185216


   **[Test build #138318 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138318/testReport)**
 for PR 32031 at commit 
[`506149f`](https://github.com/apache/spark/commit/506149f3fa92b27bdf09da6748e91516b6dd5aea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32475:
URL: https://github.com/apache/spark/pull/32475#issuecomment-836183250


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42837/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836183252


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42836/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-836183249


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42838/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-836183249


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42838/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32475:
URL: https://github.com/apache/spark/pull/32475#issuecomment-836183250


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42837/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836183252


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42836/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


SparkQA commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836177985






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-836176561






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty

2021-05-09 Thread GitBox


SparkQA commented on pull request #32475:
URL: https://github.com/apache/spark/pull/32475#issuecomment-836174973


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42837/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty

2021-05-09 Thread GitBox


SparkQA commented on pull request #32475:
URL: https://github.com/apache/spark/pull/32475#issuecomment-836171886


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42837/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836153120


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138312/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836152449


   **[Test build #138317 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138317/testReport)**
 for PR 32399 at commit 
[`e8c86db`](https://github.com/apache/spark/commit/e8c86db1753a097e7ed442fd26d064693e0803e8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835964738


   **[Test build #138312 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138312/testReport)**
 for PR 32473 at commit 
[`21cc2ac`](https://github.com/apache/spark/commit/21cc2ac907ffe9256942d818663ce225d1a1b992).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836151733


   **[Test build #138312 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138312/testReport)**
 for PR 32473 at commit 
[`21cc2ac`](https://github.com/apache/spark/commit/21cc2ac907ffe9256942d818663ce225d1a1b992).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


srowen commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836150808


   Getting pretty big! but OK if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a change in pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


ulysses-you commented on a change in pull request #32482:
URL: https://github.com/apache/spark/pull/32482#discussion_r629039347



##
File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
##
@@ -1175,7 +1175,7 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
   }
 
   test("cache supports for intervals") {
-withTable("interval_cache") {
+withTable("interval_cache", "t1") {

Review comment:
   not related this pr, but affected the new added test with `t1`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


SparkQA commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-836147815


   **[Test build #138316 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138316/testReport)**
 for PR 32482 at commit 
[`7625677`](https://github.com/apache/spark/commit/76256774c52b78b9f6011f82063004bf18734f01).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable

2021-05-09 Thread GitBox


ulysses-you commented on pull request #32482:
URL: https://github.com/apache/spark/pull/32482#issuecomment-836147309


   Thank you @maropu @c21 @dongjoon-hyun . 
   
   Agree, the current config seems overkill to user, it's better to just make 
it as `enabled`.
   
   Refactor this PR to address:
   * make the new config simple and improve the doc.
   * improve the test for two things, 1) more pattern with AQE test, 2) 
bucketed test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty

2021-05-09 Thread GitBox


SparkQA commented on pull request #32475:
URL: https://github.com/apache/spark/pull/32475#issuecomment-83614


   **[Test build #138315 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138315/testReport)**
 for PR 32475 at commit 
[`bf9d041`](https://github.com/apache/spark/commit/bf9d04140d596ba9d4cfe33b0f497a5a9045ba37).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


SparkQA commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836145492


   **[Test build #138314 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138314/testReport)**
 for PR 32487 at commit 
[`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32488:
URL: https://github.com/apache/spark/pull/32488#issuecomment-836144135


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cfmcgrady opened a new pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-09 Thread GitBox


cfmcgrady opened a new pull request #32488:
URL: https://github.com/apache/spark/pull/32488


   
   
   ### What changes were proposed in this pull request?
   
   This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`.
   
   Current implement doesn't pushdown filters for `In/InSet` which contains 
`Cast`.
   
   For instance:
   
   ```scala
   spark.range(50).selectExpr("cast(id as int) as 
id").write.mode("overwrite").parquet("/tmp/parquet/t1")
   spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
   ```
   
   before this pr:
   
   ```
   == Physical Plan ==
   *(1) Filter cast(id#5 as bigint) IN (1,2,4)
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as 
bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
   ```
   
   after this pr:
   
   ```
   == Physical Plan ==
   *(1) Filter id#95 IN (1,2,4)
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN 
(1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, 
[1,2,4])], ReadSchema: struct
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   ### How was this patch tested?
   
   
   New test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r629027318



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   @maropu - No I think we need buffer anyway. The buffered rows has same 
join keys with current streamed row. But there can be multiple followed 
streamed rows having same join keys, as the buffered rows. Even though buffered 
rows cannot match condition with current streamed row, they may match condition 
with followed streamed rows. I think this is how current sort merge join 
(code-gen & iterator) is designed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r629027318



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   @maropu - No I think we need buffer anyway. The buffered rows has same 
join keys with current streamed row. But there can be multiple followed 
streamed rows having same join keys, as the buffered rows. Even though buffered 
rows cannot match condition with current streamed rows, they may match 
condition with followed streamed rows. I think this is how current sort merge 
join (code-gen & iterator) is designed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836035623


   **[Test build #138313 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)**
 for PR 32399 at commit 
[`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


viirya commented on a change in pull request #32487:
URL: https://github.com/apache/spark/pull/32487#discussion_r629025675



##
File path: dev/create-release/release-build.sh
##
@@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then
 PYSPARK_VERSION=`echo "$SPARK_VERSION" |  sed -e "s/-/./" -e 
"s/SNAPSHOT/dev0/" -e "s/preview/dev/"`
 echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
 
+export MAVEN_OPTS="-Xmx12000m"

Review comment:
   ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836109653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138313/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836109653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138313/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836108663


   **[Test build #138313 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)**
 for PR 32399 at commit 
[`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836106608


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138311/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836106608


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138311/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #32410:
URL: https://github.com/apache/spark/pull/32410#discussion_r629020979



##
File path: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
##
@@ -141,7 +141,7 @@ public void open(Map sessionConfMap) throws 
HiveSQLException {
 sessionState = new SessionState(hiveConf, username);
 sessionState.setUserIpAddress(ipAddress);
 sessionState.setIsHiveServerQuery(true);
-SessionState.start(sessionState);
+SessionState.setCurrentSessionState(sessionState);

Review comment:
   Yes. It is safe when use `ADD JARS`. We have disabled creating these 
directories for more than a year with the following 
changes(`HiveConf.ConfVars.WITHSCRATCHDIR=false`):
   
   
![image](https://user-images.githubusercontent.com/5399861/116785447-312cc500-aacc-11eb-8dff-6ae75fbbc4d7.png)
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835906957


   **[Test build #138311 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138311/testReport)**
 for PR 32473 at commit 
[`34d0511`](https://github.com/apache/spark/commit/34d05113d307395bd1c1449651e09a8285fd0c6e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #25911: [SPARK-29223][SQL][SS] Enable global timestamp per topic while specifying offset by timestamp in Kafka source

2021-05-09 Thread GitBox


HeartSaVioR commented on pull request #25911:
URL: https://github.com/apache/spark/pull/25911#issuecomment-836089685


   I see actual customer's demand on this; "a" topic has 100+ partitions and 
it's weird to let them craft json which contains 100+ partitions for the same 
timestamp.
   
   Flink already does the thing; Flink uses global value across partitions for 
earliest/latest/timestamp, while it allows to set exact offset per partition.
   
   
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-consumers-start-position-configuration
   
   ```
   final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
   
   FlinkKafkaConsumer myConsumer = new FlinkKafkaConsumer<>(...);
   myConsumer.setStartFromEarliest(); // start from the earliest record 
possible
   myConsumer.setStartFromLatest();   // start from the latest record
   myConsumer.setStartFromTimestamp(...); // start from specified epoch 
timestamp (milliseconds)
   myConsumer.setStartFromGroupOffsets(); // the default behaviour
   ```
   
   ```
   Map specificStartOffsets = new HashMap<>();
   specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L);
   specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L);
   specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L);
   
   myConsumer.setStartFromSpecificOffsets(specificStartOffsets);
   ```
   
   Given this PR is stale, I'll rebase this with master and raise the PR again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836088555


   **[Test build #138311 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138311/testReport)**
 for PR 32473 at commit 
[`34d0511`](https://github.com/apache/spark/commit/34d05113d307395bd1c1449651e09a8285fd0c6e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #32480: [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin

2021-05-09 Thread GitBox


c21 commented on pull request #32480:
URL: https://github.com/apache/spark/pull/32480#issuecomment-836086921


   Thank you @maropu for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-09 Thread GitBox


beliefer commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r629016341



##
File path: sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql
##
@@ -0,0 +1,65 @@
+-- Test data.
+CREATE NAMESPACE IF NOT EXISTS query_ddl_namespace;
+USE NAMESPACE query_ddl_namespace;
+CREATE TABLE test_show_tables(a INT, b STRING, c INT) using parquet;
+CREATE TABLE test_show_table_properties (a INT, b STRING, c INT) USING parquet 
TBLPROPERTIES('p1'='v1', 'p2'='v2');
+CREATE TABLE test_show_partitions(a String, b Int, c String, d String) USING 
parquet PARTITIONED BY (c, d);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=1);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=2);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Cn', d=1);
+CREATE VIEW view_1 AS SELECT * FROM test_show_tables;
+CREATE VIEW view_2 AS SELECT * FROM test_show_tables WHERE c=1;
+CREATE TEMPORARY VIEW test_show_views(e int) USING parquet;
+CREATE GLOBAL TEMP VIEW test_global_show_views AS SELECT 1 as col1;
+
+-- SHOW NAMESPACES
+SHOW NAMESPACES;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 
'query_ddl_namespace';
+WITH s(n) AS (SHOW NAMESPACES) SELECT * FROM s WHERE n = 'query_ddl_namespace';
+
+-- SHOW TABLES
+SHOW TABLES;
+WITH s AS (SHOW TABLES) SELECT * FROM s;
+WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables';
+WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 
'test_show_tables';

Review comment:
   OK




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang closed pull request #32374: [WIP][SPARK-35253][BUILD][SQL] Upgrade Janino from 3.0.16 to 3.1.3

2021-05-09 Thread GitBox


LuciferYang closed pull request #32374:
URL: https://github.com/apache/spark/pull/32374


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #32374: [WIP][SPARK-35253][BUILD][SQL] Upgrade Janino from 3.0.16 to 3.1.3

2021-05-09 Thread GitBox


LuciferYang commented on pull request #32374:
URL: https://github.com/apache/spark/pull/32374#issuecomment-836082137


   close this because SPARK-35253


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #32455: [SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4

2021-05-09 Thread GitBox


LuciferYang commented on a change in pull request #32455:
URL: https://github.com/apache/spark/pull/32455#discussion_r629014929



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
##
@@ -1434,9 +1435,10 @@ object CodeGenerator extends Logging {
   private def updateAndGetCompilationStats(evaluator: ClassBodyEvaluator): 
ByteCodeStats = {
 // First retrieve the generated classes.
 val classes = {
-  val resultField = classOf[SimpleCompiler].getDeclaredField("result")
-  resultField.setAccessible(true)
-  val loader = 
resultField.get(evaluator).asInstanceOf[ByteArrayClassLoader]
+  val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc")

Review comment:
   @maropu  Can we directly use `evaluator.getBytecodes.asScala` instead of 
line 1438 ~ line 1445?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836069987


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42835/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836069987


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42835/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #32350: [SPARK-35231][SQL] logical.Range override maxRowsPerPartition

2021-05-09 Thread GitBox


zhengruifeng commented on pull request #32350:
URL: https://github.com/apache/spark/pull/32350#issuecomment-836067509


   Thank you so much! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836058502


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42835/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32487:
URL: https://github.com/apache/spark/pull/32487#discussion_r629004607



##
File path: dev/create-release/release-build.sh
##
@@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then
 PYSPARK_VERSION=`echo "$SPARK_VERSION" |  sed -e "s/-/./" -e 
"s/SNAPSHOT/dev0/" -e "s/preview/dev/"`
 echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
 
+export MAVEN_OPTS="-Xmx12000m"

Review comment:
   nit: we can say `-Xmx12g`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


huaxingao commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836051980


   @dongjoon-hyun 
   
   > Shall we change the grouping in order see the trend according to the block 
size?
   
   Sorry, I just saw your comment. I guess it might be a little better to pair 
up the results of `Without bloom filter` and `With bloom filter` so it's easier 
to see the improvement for bloom filter?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


huaxingao commented on a change in pull request #32473:
URL: https://github.com/apache/spark/pull/32473#discussion_r629004056



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala
##
@@ -81,8 +80,57 @@ object BloomFilterBenchmark extends SqlBasedBenchmark {
 }
   }
 
+  private def writeParquetBenchmark(): Unit = {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+
+  runBenchmark(s"Parquet Write") {
+val benchmark = new Benchmark(s"Write ${scaleFactor}M rows", N, output 
= output)
+benchmark.addCase("Without bloom filter") { _ =>
+  df.write.mode("overwrite").parquet(path + "/withoutBF")
+}
+benchmark.addCase("With bloom filter") { _ =>
+  df.write.mode("overwrite")
+.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true)
+.parquet(path + "/withBF")
+}
+benchmark.run()
+  }
+}
+  }
+
+  private def readParquetBenchmark(): Unit = {
+val blockSizes = Seq(512 * 1024, 1024 * 1024, 2 * 1024 * 1024, 3 * 1024 * 
1024,
+  4 * 1024 * 1024, 5 * 1024 * 1024, 6 * 1024 * 1024, 7 * 1024 * 1024,
+  8 * 1024 * 1024, 9 * 1024 * 1024, 10 * 1024 * 1024)
+for (blocksize <- blockSizes) {
+  withTempPath { dir =>
+val path = dir.getCanonicalPath
+
+df.write.option("parquet.block.size", blocksize).parquet(path + 
"/withoutBF")

Review comment:
   @wangyum Sorry, I am new to parquet. Somehow I didn't see parquet has 
compression size, seems only ORC has `orc.compress.size`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or

2021-05-09 Thread GitBox


SparkQA commented on pull request #32399:
URL: https://github.com/apache/spark/pull/32399#issuecomment-836035623


   **[Test build #138313 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)**
 for PR 32399 at commit 
[`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836035119


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138310/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836035114






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-836035119


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138310/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836035114






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32480: [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin

2021-05-09 Thread GitBox


maropu commented on pull request #32480:
URL: https://github.com/apache/spark/pull/32480#issuecomment-836019661


   Thank you, @c21. Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu closed pull request #32480: [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin

2021-05-09 Thread GitBox


maropu closed pull request #32480:
URL: https://github.com/apache/spark/pull/32480


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835996955






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628987144



##
File path: sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql
##
@@ -0,0 +1,65 @@
+-- Test data.
+CREATE NAMESPACE IF NOT EXISTS query_ddl_namespace;
+USE NAMESPACE query_ddl_namespace;
+CREATE TABLE test_show_tables(a INT, b STRING, c INT) using parquet;
+CREATE TABLE test_show_table_properties (a INT, b STRING, c INT) USING parquet 
TBLPROPERTIES('p1'='v1', 'p2'='v2');
+CREATE TABLE test_show_partitions(a String, b Int, c String, d String) USING 
parquet PARTITIONED BY (c, d);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=1);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=2);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Cn', d=1);
+CREATE VIEW view_1 AS SELECT * FROM test_show_tables;
+CREATE VIEW view_2 AS SELECT * FROM test_show_tables WHERE c=1;
+CREATE TEMPORARY VIEW test_show_views(e int) USING parquet;
+CREATE GLOBAL TEMP VIEW test_global_show_views AS SELECT 1 as col1;
+
+-- SHOW NAMESPACES
+SHOW NAMESPACES;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 
'query_ddl_namespace';
+WITH s(n) AS (SHOW NAMESPACES) SELECT * FROM s WHERE n = 'query_ddl_namespace';
+
+-- SHOW TABLES
+SHOW TABLES;
+WITH s AS (SHOW TABLES) SELECT * FROM s;
+WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables';
+WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 
'test_show_tables';

Review comment:
   Could we add more tests? For example:
   ```sql
   WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s;
   WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s ORDER BY rn;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628986405



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   btw, in the current generated code, it seems `conditionCheck` is 
evaluated outside `findNextJoinRows`. We cannot evaluate it inside 
`findNextJoinRows` to avoid putting unmached rows in `matches`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628986405



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   btw, in the current generated code, it seems `conditionCheck` is 
evaluated outside `findNextJoinRows`. We cannot evaluate it inside 
`findNextJoinRows` to avoid actual putting unmached rows in `matches`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628980328



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -375,8 +363,18 @@ ctes
 : WITH namedQuery (',' namedQuery)*
 ;
 
+informationQuery
+: SHOW (DATABASES | NAMESPACES) ((FROM | IN) multipartIdentifier)? (LIKE? 
pattern=STRING)?  #showNamespaces
+| SHOW TABLES ((FROM | IN) multipartIdentifier)? (LIKE? pattern=STRING)?   
 #showTables

Review comment:
   Why do not support `SHOW TABLE EXTENDED`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628977186



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(
+ |scala.collection.Iterator streamedIter,
+ |scala.collection.Iterator bufferedIter) {
+ |  $streamedRow = null;
  |  int comp = 0;
- |  while ($leftRow == null) {
- |if (!leftIter.hasNext()) return false;
- |$leftRow = (InternalRow) leftIter.next();
- |${leftKeyVars.map(_.code).mkString("\n")}
- |if ($leftAnyNull) {
- |  $leftRow = null;
- |  continue;
+ |  while ($streamedRow == null) {
+ |if (!streamedIter.hasNext()) return false;
+ |$streamedRow = (InternalRow) streamedIter.next();
+ |${streamedKeyVars.map(_.code).mkString("\n")}
+ |if ($streamedAnyNull) {
+ |  $handleStreamedAnyNull
  |}
  |if (!$matches.isEmpty()) {
- |  ${genComparison(ctx, leftKeyVars, matchedKeyVars)}
+ |  ${genComparison(ctx, streamedKeyVars, matchedKeyVars)}
  |  if (comp == 0) {
  |return true;
  |  }
  |  $matches.clear();
  |}
  |
  |do {
- |  if ($rightRow == null) {
- |if (!rightIter.hasNext()) {
+ |  if ($bufferedRow == null) {
+ |if (!bufferedIter.hasNext()) {
  |  ${matchedKeyVars.map(_.code).mkString("\n")}
  |  return !$matches.isEmpty();
  |}
- |$rightRow = (InternalRow) rightIter.next();
- |${rightKeyTmpVars.map(_.code).mkString("\n")}
- |if ($rightAnyNull) {
- |  $rightRow = null;
+ |$bufferedRow = (InternalRow) bufferedIter.next();
+ |${bufferedKeyTmpVars.map(_.code).mkString("\n")}
+ |if ($bufferedAnyNull) {
+ |  $bufferedRow = null;
  |  continue;
  |}
- |${rightKeyVars.map(_.code).mkString("\n")}
+ |${bufferedKeyVars.map(_.code).mkString("\n")}
  |  }
- |  ${genComparison(ctx, leftKeyVars, rightKeyVars)}
+ |  ${genComparison(ctx, streamedKeyVars, bufferedKeyVars)}
  |  if (comp > 0) {
- |$rightRow = null;
+ |$bufferedRow = null;
  |  } else if (comp < 0) {
  |if (!$matches.isEmpty()) {
  |  ${matchedKeyVars.map(_.code).mkString("\n")}
  |  return true;
+ |} else {
+ |  $handleStreamedLessThanBuffered
  |}
- |$leftRow = null;
  |  } else {
- |

[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628976694



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -554,67 +604,118 @@ case class SortMergeJoinExec(
 
   override def doProduce(ctx: CodegenContext): String = {
 // Inline mutable state since not many join operations in a task
-val leftInput = ctx.addMutableState("scala.collection.Iterator", 
"leftInput",
+val streamedInput = ctx.addMutableState("scala.collection.Iterator", 
"streamedInput",
   v => s"$v = inputs[0];", forceInline = true)
-val rightInput = ctx.addMutableState("scala.collection.Iterator", 
"rightInput",
+val bufferedInput = ctx.addMutableState("scala.collection.Iterator", 
"bufferedInput",
   v => s"$v = inputs[1];", forceInline = true)
 
-val (leftRow, matches) = genScanner(ctx)
+val (streamedRow, matches) = genScanner(ctx)
 
 // Create variables for row from both sides.
-val (leftVars, leftVarDecl) = createLeftVars(ctx, leftRow)
-val rightRow = ctx.freshName("rightRow")
-val rightVars = createRightVar(ctx, rightRow)
+val (streamedVars, streamedVarDecl) = createStreamedVars(ctx, streamedRow)
+val bufferedRow = ctx.freshName("bufferedRow")
+val bufferedVars = genBuildSideVars(ctx, bufferedRow, bufferedPlan)
 
 val iterator = ctx.freshName("iterator")
 val numOutput = metricTerm(ctx, "numOutputRows")
-val (beforeLoop, condCheck) = if (condition.isDefined) {
+val resultVars = joinType match {
+  case _: InnerLike | LeftOuter =>
+streamedVars ++ bufferedVars
+  case RightOuter =>
+bufferedVars ++ streamedVars
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.doProduce should not take $x as the JoinType")
+}
+
+val (beforeLoop, conditionCheck) = if (condition.isDefined) {
   // Split the code of creating variables based on whether it's used by 
condition or not.
   val loaded = ctx.freshName("loaded")
-  val (leftBefore, leftAfter) = splitVarsByCondition(left.output, leftVars)
-  val (rightBefore, rightAfter) = splitVarsByCondition(right.output, 
rightVars)
+  val (streamedBefore, streamedAfter) = 
splitVarsByCondition(streamedOutput, streamedVars)
+  val (bufferedBefore, bufferedAfter) = 
splitVarsByCondition(bufferedOutput, bufferedVars)
   // Generate code for condition
-  ctx.currentVars = leftVars ++ rightVars
+  ctx.currentVars = resultVars
   val cond = BindReferences.bindReference(condition.get, 
output).genCode(ctx)
   // evaluate the columns those used by condition before loop
-  val before = s"""
+  val before =
+s"""
|boolean $loaded = false;
-   |$leftBefore
+   |$streamedBefore
  """.stripMargin
 
-  val checking = s"""
- |$rightBefore
- |${cond.code}
- |if (${cond.isNull} || !${cond.value}) continue;
- |if (!$loaded) {
- |  $loaded = true;
- |  $leftAfter
- |}
- |$rightAfter
- """.stripMargin
+  val checking =
+s"""
+   |$bufferedBefore
+   |if ($bufferedRow != null) {
+   |  ${cond.code}
+   |  if (${cond.isNull} || !${cond.value}) {
+   |continue;
+   |  }
+   |}
+   |if (!$loaded) {
+   |  $loaded = true;
+   |  $streamedAfter
+   |}
+   |$bufferedAfter
+ """.stripMargin
   (before, checking)
 } else {
-  (evaluateVariables(leftVars), "")
+  (evaluateVariables(streamedVars), "")
 }
 
 val thisPlan = ctx.addReferenceObj("plan", this)
 val eagerCleanup = s"$thisPlan.cleanupResources();"
 
-s"""
-   |while (findNextInnerJoinRows($leftInput, $rightInput)) {
-   |  ${leftVarDecl.mkString("\n")}
-   |  ${beforeLoop.trim}
-   |  scala.collection.Iterator $iterator = 
$matches.generateIterator();
-   |  while ($iterator.hasNext()) {
-   |InternalRow $rightRow = (InternalRow) $iterator.next();
-   |${condCheck.trim}
-   |$numOutput.add(1);
-   |${consume(ctx, leftVars ++ rightVars)}
-   |  }
-   |  if (shouldStop()) return;
-   |}
-   |$eagerCleanup
+lazy val innerJoin =
+  s"""
+ |while (findNextJoinRows($streamedInput, $bufferedInput)) {
+ |  ${streamedVarDecl.mkString("\n")}
+ |  ${beforeLoop.trim}
+ |  scala.collection.Iterator $iterator = 
$matches.generateIterator();
+ |  while ($iterator.hasNext()) {
+ |InternalRow $bufferedRow = (InternalRow) $iterator.next();
+ |${conditionCheck.trim}
+ |$numOutput.add(1);
+ |${consume(ctx, resultVars)}
+ |  }
+ |  if (shouldStop()) return;
+ |}
+ |$eagerCleanup
  """.stripMargin
+
+lazy 

[GitHub] [spark] wangyum commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628976181



##
File path: sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql
##
@@ -0,0 +1,65 @@
+-- Test data.
+CREATE NAMESPACE IF NOT EXISTS query_ddl_namespace;
+USE NAMESPACE query_ddl_namespace;
+CREATE TABLE test_show_tables(a INT, b STRING, c INT) using parquet;
+CREATE TABLE test_show_table_properties (a INT, b STRING, c INT) USING parquet 
TBLPROPERTIES('p1'='v1', 'p2'='v2');
+CREATE TABLE test_show_partitions(a String, b Int, c String, d String) USING 
parquet PARTITIONED BY (c, d);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=1);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=2);
+ALTER TABLE test_show_partitions ADD PARTITION (c='Cn', d=1);
+CREATE VIEW view_1 AS SELECT * FROM test_show_tables;
+CREATE VIEW view_2 AS SELECT * FROM test_show_tables WHERE c=1;
+CREATE TEMPORARY VIEW test_show_views(e int) USING parquet;
+CREATE GLOBAL TEMP VIEW test_global_show_views AS SELECT 1 as col1;
+
+-- SHOW NAMESPACES
+SHOW NAMESPACES;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s;
+WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 
'query_ddl_namespace';
+WITH s(n) AS (SHOW NAMESPACES) SELECT * FROM s WHERE n = 'query_ddl_namespace';
+
+-- SHOW TABLES
+SHOW TABLES;
+WITH s AS (SHOW TABLES) SELECT * FROM s;
+WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables';
+WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 
'test_show_tables';
+
+-- SHOW TBLPROPERTIES
+SHOW TBLPROPERTIES test_show_table_properties;
+WITH s AS (SHOW TBLPROPERTIES test_show_table_properties) SELECT * FROM s;
+WITH s AS (SHOW TBLPROPERTIES test_show_table_properties) SELECT * FROM s 
WHERE key = 'p1';
+WITH s(k, v) AS (SHOW TBLPROPERTIES test_show_table_properties) SELECT * FROM 
s WHERE k = 'p1';
+
+-- SHOW PARTITIONS
+SHOW PARTITIONS test_show_partitions;
+WITH s AS (SHOW PARTITIONS test_show_partitions) SELECT * FROM s;
+WITH s AS (SHOW PARTITIONS test_show_partitions) SELECT * FROM s WHERE 
partition = 'c=Us/d=1';
+WITH s(p) AS (SHOW PARTITIONS test_show_partitions) SELECT * FROM s WHERE p = 
'c=Us/d=1';
+
+-- SHOW COLUMNS
+SHOW COLUMNS in test_show_tables;
+WITH s AS (SHOW COLUMNS in test_show_tables) SELECT * FROM s;
+WITH s AS (SHOW COLUMNS in test_show_tables) SELECT * FROM s WHERE col_name = 
'a';
+WITH s(c) AS (SHOW COLUMNS in test_show_tables) SELECT * FROM s WHERE c = 'a';
+
+-- SHOW VIEWS
+SHOW VIEWS;
+WITH s AS (SHOW VIEWS) SELECT * FROM s;
+WITH s AS (SHOW VIEWS) SELECT * FROM s WHERE viewName = 'test_show_views';
+WITH s(ns, vn, t) AS (SHOW VIEWS) SELECT * FROM s WHERE vn = 'test_show_views';
+
+-- SHOW FUNCTIONS
+WITH s AS (SHOW FUNCTIONS) SELECT * FROM s LIMIT 3;
+WITH s AS (SHOW FUNCTIONS) SELECT * FROM s WHERE function LIKE 'an%';
+WITH s(f) AS (SHOW FUNCTIONS) SELECT * FROM s WHERE f LIKE 'an%';
+
+-- Clean Up
+DROP VIEW global_temp.test_global_show_views;
+DROP VIEW test_show_views;
+DROP VIEW view_2;
+DROP VIEW view_1;
+DROP TABLE test_show_partitions;
+DROP TABLE test_show_table_properties;
+DROP TABLE test_show_tables;
+USE default;
+DROP NAMESPACE query_ddl_namespace;

Review comment:
   Please add a newline character.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835879367


   **[Test build #138309 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138309/testReport)**
 for PR 32473 at commit 
[`10d7a97`](https://github.com/apache/spark/commit/10d7a977391d659d2060ba596c55d0334754866c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628974305



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   > In the outer case, a return value is not used?
   Yes. Otherwise it's very hard to re-use code in findNextJoinRows. I can 
further make more change to not return anything for findNextJoinRows in case 
it's an outer join. Do we want to do that?
   
   okay, the current one looks fine. Let's just wait for a @cloud-fan comment 
here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835985975


   **[Test build #138309 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138309/testReport)**
 for PR 32473 at commit 
[`10d7a97`](https://github.com/apache/spark/commit/10d7a977391d659d2060ba596c55d0334754866c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628972459



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   > Why we don't need to put all the rows? We anyway need to evaluate all 
the rows on buffered side for join, right?
   
   Oh, my bad. ya, you're right. I misunderstood it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628969762



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;

Review comment:
   I see. Could you leave some comments about it there?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


SparkQA removed a comment on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-835906899


   **[Test build #138310 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138310/testReport)**
 for PR 32487 at commit 
[`2d27589`](https://github.com/apache/spark/commit/2d275891147341ef233ac2082e973a0e98660832).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


SparkQA commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-835979912


   **[Test build #138310 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138310/testReport)**
 for PR 32487 at commit 
[`2d27589`](https://github.com/apache/spark/commit/2d275891147341ef233ac2082e973a0e98660832).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on pull request #32476:
URL: https://github.com/apache/spark/pull/32476#issuecomment-835977883


   > @maropu - JoinBenchmark has only inner sort merge join, but not left/right 
outer join. So this PR does not affect the result of benchmark as it is. Shall 
we have a followup PR to update the join benchmark? I wanted to add other more 
test cases in JoinBenchmark as well.
   
   Ah, okay. sgtm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on pull request #32476:
URL: https://github.com/apache/spark/pull/32476#issuecomment-835976988


   @maropu - `JoinBenchmark` has only inner sort merge join, but not left/right 
outer join. So this PR does not affect the result of benchmark as it is. Shall 
we have a followup PR to update the join benchmark? I wanted to add other more 
test cases in `JoinBenchmark` as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #32473:
URL: https://github.com/apache/spark/pull/32473#discussion_r628967020



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala
##
@@ -81,8 +80,57 @@ object BloomFilterBenchmark extends SqlBasedBenchmark {
 }
   }
 
+  private def writeParquetBenchmark(): Unit = {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+
+  runBenchmark(s"Parquet Write") {
+val benchmark = new Benchmark(s"Write ${scaleFactor}M rows", N, output 
= output)
+benchmark.addCase("Without bloom filter") { _ =>
+  df.write.mode("overwrite").parquet(path + "/withoutBF")
+}
+benchmark.addCase("With bloom filter") { _ =>
+  df.write.mode("overwrite")
+.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true)
+.parquet(path + "/withBF")
+}
+benchmark.run()
+  }
+}
+  }
+
+  private def readParquetBenchmark(): Unit = {
+val blockSizes = Seq(512 * 1024, 1024 * 1024, 2 * 1024 * 1024, 3 * 1024 * 
1024,
+  4 * 1024 * 1024, 5 * 1024 * 1024, 6 * 1024 * 1024, 7 * 1024 * 1024,
+  8 * 1024 * 1024, 9 * 1024 * 1024, 10 * 1024 * 1024)
+for (blocksize <- blockSizes) {
+  withTempPath { dir =>
+val path = dir.getCanonicalPath
+
+df.write.option("parquet.block.size", blocksize).parquet(path + 
"/withoutBF")

Review comment:
   Could we use the same value for block size and compression size? Please 
see how we did it in 
[FilterPushdownBenchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L61-L62).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


c21 commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628966219



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -353,12 +353,37 @@ case class SortMergeJoinExec(
 }
   }
 
-  override def supportCodegen: Boolean = {
-joinType.isInstanceOf[InnerLike]
+  private lazy val (streamedPlan, bufferedPlan) = joinType match {

Review comment:
   @maropu - yes, this is used for code-gen only. Note here we only pattern 
match inner/left outer/right outer join, so it will throw exception with `val` 
for other join types.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;

Review comment:
   Wanted to avoid `clear()` if `isEmpty()` is true. 
`ExternalAppendOnlyUnsafeRowArray.isEmpty()` is very cheap but `clear()` sets 
multiple variables.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   > For example, if there are too many matched duplicate rows in the 
buffered side, it seems we don't need to put all the rows in matches, right?
   
   Why we don't need to put all the rows? We anyway need to evaluate all the 
rows on buffered side for join, right?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the 

[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-09 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835972741


   @dongjoon-hyun This pr only improve the `In` predicate. I have added the 
improvement part to PR description.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on pull request #32476:
URL: https://github.com/apache/spark/pull/32476#issuecomment-835970308


   Could you update the `JoinBenchmark` results, too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

2021-05-09 Thread GitBox


maropu commented on a change in pull request #32476:
URL: https://github.com/apache/spark/pull/32476#discussion_r628960873



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -418,115 +443,140 @@ case class SortMergeJoinExec(
 // Inline mutable state since not many join operations in a task
 val matches = ctx.addMutableState(clsName, "matches",
   v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", 
forceInline = true)
-// Copy the left keys as class members so they could be used in next 
function call.
-val matchedKeyVars = copyKeys(ctx, leftKeyVars)
+// Copy the streamed keys as class members so they could be used in next 
function call.
+val matchedKeyVars = copyKeys(ctx, streamedKeyVars)
+
+// Handle the case when streamed rows has any NULL keys.
+val handleStreamedAnyNull = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"""
+   |$streamedRow = null;
+   |continue;
+ """.stripMargin
+  case LeftOuter | RightOuter =>
+// Eagerly return streamed row.
+s"""
+   |if (!$matches.isEmpty()) {
+   |  $matches.clear();
+   |}
+   |return false;
+ """.stripMargin
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
 
-ctx.addNewFunction("findNextInnerJoinRows",
+// Handle the case when streamed keys less than buffered keys.
+val handleStreamedLessThanBuffered = joinType match {
+  case _: InnerLike =>
+// Skip streamed row.
+s"$streamedRow = null;"
+  case LeftOuter | RightOuter =>
+// Eagerly return with streamed row.
+"return false;"
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.genScanner should not take $x as the JoinType")
+}
+
+ctx.addNewFunction("findNextJoinRows",
   s"""
- |private boolean findNextInnerJoinRows(
- |scala.collection.Iterator leftIter,
- |scala.collection.Iterator rightIter) {
- |  $leftRow = null;
+ |private boolean findNextJoinRows(

Review comment:
   In the outer case, a return value is not used?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
##
@@ -554,67 +604,118 @@ case class SortMergeJoinExec(
 
   override def doProduce(ctx: CodegenContext): String = {
 // Inline mutable state since not many join operations in a task
-val leftInput = ctx.addMutableState("scala.collection.Iterator", 
"leftInput",
+val streamedInput = ctx.addMutableState("scala.collection.Iterator", 
"streamedInput",
   v => s"$v = inputs[0];", forceInline = true)
-val rightInput = ctx.addMutableState("scala.collection.Iterator", 
"rightInput",
+val bufferedInput = ctx.addMutableState("scala.collection.Iterator", 
"bufferedInput",
   v => s"$v = inputs[1];", forceInline = true)
 
-val (leftRow, matches) = genScanner(ctx)
+val (streamedRow, matches) = genScanner(ctx)
 
 // Create variables for row from both sides.
-val (leftVars, leftVarDecl) = createLeftVars(ctx, leftRow)
-val rightRow = ctx.freshName("rightRow")
-val rightVars = createRightVar(ctx, rightRow)
+val (streamedVars, streamedVarDecl) = createStreamedVars(ctx, streamedRow)
+val bufferedRow = ctx.freshName("bufferedRow")
+val bufferedVars = genBuildSideVars(ctx, bufferedRow, bufferedPlan)
 
 val iterator = ctx.freshName("iterator")
 val numOutput = metricTerm(ctx, "numOutputRows")
-val (beforeLoop, condCheck) = if (condition.isDefined) {
+val resultVars = joinType match {
+  case _: InnerLike | LeftOuter =>
+streamedVars ++ bufferedVars
+  case RightOuter =>
+bufferedVars ++ streamedVars
+  case x =>
+throw new IllegalArgumentException(
+  s"SortMergeJoin.doProduce should not take $x as the JoinType")
+}
+
+val (beforeLoop, conditionCheck) = if (condition.isDefined) {
   // Split the code of creating variables based on whether it's used by 
condition or not.
   val loaded = ctx.freshName("loaded")
-  val (leftBefore, leftAfter) = splitVarsByCondition(left.output, leftVars)
-  val (rightBefore, rightAfter) = splitVarsByCondition(right.output, 
rightVars)
+  val (streamedBefore, streamedAfter) = 
splitVarsByCondition(streamedOutput, streamedVars)
+  val (bufferedBefore, bufferedAfter) = 
splitVarsByCondition(bufferedOutput, bufferedVars)
   // Generate code for condition
-  ctx.currentVars = leftVars ++ rightVars
+  ctx.currentVars = resultVars
   val cond = BindReferences.bindReference(condition.get, 
output).genCode(ctx)
   // evaluate the columns those used by condition before loop
-  val before = s"""
+  val before 

[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628965380



##
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized10512  10572 
 58  1.5 668.4   1.0X
-Parquet Vectorized (Pushdown)   596621 
 19 26.4  37.9  17.6X
-Native ORC Vectorized  8555   8723 
 97  1.8 543.9   1.2X
-Native ORC Vectorized (Pushdown)592609 
 11 26.6  37.7  17.8X
+Parquet Vectorized 9788  10231 
259  1.6 622.3   1.0X
+Parquet Vectorized (Pushdown)   493536 
 29 31.9  31.3  19.9X
+Native ORC Vectorized  6487   6575 
137  2.4 412.4   1.5X
+Native ORC Vectorized (Pushdown)436447 
 14 36.1  27.7  22.4X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row ('7864320' < value < '7864320'):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-Parquet Vectorized   10406  
10461  50  1.5 661.6   1.0X
-Parquet Vectorized (Pushdown)  619
641  22 25.4  39.4  16.8X
-Native ORC Vectorized 8787   
8834  57  1.8 558.6   1.2X
-Native ORC Vectorized (Pushdown)   592
608  11 26.6  37.6  17.6X
+Parquet Vectorized9861   
9880  16  1.6 626.9   1.0X
+Parquet Vectorized (Pushdown)  507
529  21 31.0  32.3  19.4X
+Native ORC Vectorized 6871   
6938  63  2.3 436.8   1.4X
+Native ORC Vectorized (Pushdown)   453
470  13 34.7  28.8  21.8X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 1 string row (value = '7864320'):  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized10632  10694 
 60  1.5 676.0   1.0X
-Parquet Vectorized (Pushdown)   608635 
 22 25.9  38.6  17.5X
-Native ORC Vectorized  8790   8838 
 37  1.8 558.9   1.2X
-Native ORC Vectorized (Pushdown)559584 
 22 28.1  35.5  19.0X
+Parquet Vectorized10228  10471 
167  1.5 650.3   1.0X
+Parquet Vectorized (Pushdown)   511519 
  5 30.8  32.5  20.0X
+Native ORC Vectorized  6700   6865 
119  2.3 426.0   1.5X
+Native ORC Vectorized (Pushdown)436454 
 12 36.1  27.7  23.5X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 

[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835964738


   **[Test build #138312 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138312/testReport)**
 for PR 32473 at commit 
[`21cc2ac`](https://github.com/apache/spark/commit/21cc2ac907ffe9256942d818663ce225d1a1b992).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-09 Thread GitBox


wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628964580



##
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized10512  10572 
 58  1.5 668.4   1.0X
-Parquet Vectorized (Pushdown)   596621 
 19 26.4  37.9  17.6X
-Native ORC Vectorized  8555   8723 
 97  1.8 543.9   1.2X
-Native ORC Vectorized (Pushdown)592609 
 11 26.6  37.7  17.8X
+Parquet Vectorized 9788  10231 
259  1.6 622.3   1.0X
+Parquet Vectorized (Pushdown)   493536 
 29 31.9  31.3  19.9X
+Native ORC Vectorized  6487   6575 
137  2.4 412.4   1.5X
+Native ORC Vectorized (Pushdown)436447 
 14 36.1  27.7  22.4X

Review comment:
   No. Github action runs on different machines, there is a performance 
difference between them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

2021-05-09 Thread GitBox


github-actions[bot] commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-835957576


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-835929791


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42832/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835929789


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42833/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-835929789


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42833/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


AmplabJenkins commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-835929791


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42832/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


viirya commented on a change in pull request #32487:
URL: https://github.com/apache/spark/pull/32487#discussion_r628955847



##
File path: dev/create-release/release-build.sh
##
@@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then
 PYSPARK_VERSION=`echo "$SPARK_VERSION" |  sed -e "s/-/./" -e 
"s/SNAPSHOT/dev0/" -e "s/preview/dev/"`
 echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
 
+export MAVEN_OPTS="-Xmx12000m"

Review comment:
   okay




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


dongjoon-hyun commented on pull request #32487:
URL: https://github.com/apache/spark/pull/32487#issuecomment-835927141


   Also, cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM

2021-05-09 Thread GitBox


dongjoon-hyun commented on a change in pull request #32487:
URL: https://github.com/apache/spark/pull/32487#discussion_r628955769



##
File path: dev/create-release/release-build.sh
##
@@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then
 PYSPARK_VERSION=`echo "$SPARK_VERSION" |  sed -e "s/-/./" -e 
"s/SNAPSHOT/dev0/" -e "s/preview/dev/"`
 echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
 
+export MAVEN_OPTS="-Xmx12000m"

Review comment:
   Can we have this globally outside of `if` statement? Then, it looks like 
we need only one line addition.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >