[GitHub] [spark] cfmcgrady closed pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-12 Thread GitBox


cfmcgrady closed pull request #32488:
URL: https://github.com/apache/spark/pull/32488


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #32523: [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs.

2021-05-12 Thread GitBox


HyukjinKwon closed pull request #32523:
URL: https://github.com/apache/spark/pull/32523


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32523: [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs.

2021-05-12 Thread GitBox


HyukjinKwon commented on pull request #32523:
URL: https://github.com/apache/spark/pull/32523#issuecomment-840324993


   Merged to master and branch-3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-840218983


   **[Test build #138481 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138481/testReport)**
 for PR 32448 at commit 
[`93b47d3`](https://github.com/apache/spark/commit/93b47d3f190369afdf5a2a5ae0ec0c6054b56c1b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-12 Thread GitBox


SparkQA commented on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-840324232


   **[Test build #138481 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138481/testReport)**
 for PR 32448 at commit 
[`93b47d3`](https://github.com/apache/spark/commit/93b47d3f190369afdf5a2a5ae0ec0c6054b56c1b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32527:
URL: https://github.com/apache/spark/pull/32527#issuecomment-840217408


   **[Test build #138480 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138480/testReport)**
 for PR 32527 at commit 
[`2831f9c`](https://github.com/apache/spark/commit/2831f9c0b78aa21c6cc906370fb5069e166dbf39).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


SparkQA commented on pull request #32527:
URL: https://github.com/apache/spark/pull/32527#issuecomment-840322575


   **[Test build #138480 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138480/testReport)**
 for PR 32527 at commit 
[`2831f9c`](https://github.com/apache/spark/commit/2831f9c0b78aa21c6cc906370fb5069e166dbf39).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


SparkQA commented on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840318050


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43012/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


SparkQA commented on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840315107


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43012/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


HyukjinKwon edited a comment on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271


   @itholic:
   
   1. Please check the option **one by one** and see if each exists, and is 
matched.
   2. Document general options in 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if 
there are missing ones
   3. If you're going to do 2. separately in another PR and JIRA, don't remove 
general options in API documentations for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


HyukjinKwon edited a comment on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271


   @itholic:
   
   1. Please check the option **one by one** and see if each exists.
   2. Document general options in 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if 
there are missing ones
   3. If you're going to do 2. separately in another PR and JIRA, don't remove 
general options in API documentations for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840312669


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43008/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840312669


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43008/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


SparkQA commented on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840312637






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.

2021-05-12 Thread GitBox


HyukjinKwon commented on pull request #32161:
URL: https://github.com/apache/spark/pull/32161#issuecomment-840312618


   Same comment goes here too: 
https://github.com/apache/spark/pull/32204#issuecomment-840312271


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32531:
URL: https://github.com/apache/spark/pull/32531#issuecomment-840312131


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43011/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


sunchao commented on a change in pull request #32527:
URL: https://github.com/apache/spark/pull/32527#discussion_r631576884



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression 
{
   arguments: Seq[Expression],
   input: InternalRow,
   dataType: DataType): Any = {
-val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
-if (needNullCheck && args.exists(_ == null)) {
+var i = 0
+val len = arguments.length
+while (i < len) {
+  evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]
+  i += 1
+}
+if (needNullCheck && evaluatedArgs.contains(null)) {
   // return null if one of arguments is null
   null
 } else {
   val ret = try {
-method.invoke(obj, args: _*)
+method.invoke(obj, evaluatedArgs: _*)
   } catch {

Review comment:
   I'm not sure if we can do the similar thing in `Invoke.eval` though 
since `obj` in `obj.getClass.getMethod(functionName, argClasses: _*)` is 
different for each call.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


HyukjinKwon commented on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271


   @itholic:
   
   1. Please check the option **one by one** and see if each exists.
   2. Document general options in 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if 
there are missing ones
   3. If you're going to do this separately in a separate JIRA, don't remove 
general options in API documentations for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32531:
URL: https://github.com/apache/spark/pull/32531#issuecomment-840312131


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43011/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


SparkQA commented on pull request #32531:
URL: https://github.com/apache/spark/pull/32531#issuecomment-840312101






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r631576139



##
File path: python/pyspark/sql/streaming.py
##
@@ -504,105 +504,15 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 path : str
 string represents path to the JSON dataset,
 or RDD of Strings storing JSON objects.
-schema : :class:`pyspark.sql.types.StructType` or str, optional

Review comment:
   I don't think this is a general option




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r631575888



##
File path: python/pyspark/sql/readwriter.py
##
@@ -1196,39 +1097,13 @@ def json(self, path, mode=None, compression=None, 
dateFormat=None, timestampForm
 --
 path : str
 the path in any Hadoop supported file system
-mode : str, optional

Review comment:
   mode is a general option




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840292938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138477/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.

2021-05-12 Thread GitBox


SparkQA commented on pull request #32161:
URL: https://github.com/apache/spark/pull/32161#issuecomment-840310729


   **[Test build #138497 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138497/testReport)**
 for PR 32161 at commit 
[`bb5cd45`](https://github.com/apache/spark/commit/bb5cd4529b07b05b21cdaf878b06b61ad717be79).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState

2021-05-12 Thread GitBox


SparkQA commented on pull request #32410:
URL: https://github.com/apache/spark/pull/32410#issuecomment-840310594


   **[Test build #138496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138496/testReport)**
 for PR 32410 at commit 
[`4bca8ec`](https://github.com/apache/spark/commit/4bca8ecaec066ef19d04a12e134ba830320a2e0f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840310493


   **[Test build #138495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138495/testReport)**
 for PR 32494 at commit 
[`1573522`](https://github.com/apache/spark/commit/1573522541ceaf1e0b6e0eccb108b88f0fb1a4c6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840310425


   **[Test build #138494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138494/testReport)**
 for PR 32498 at commit 
[`b7a6cc7`](https://github.com/apache/spark/commit/b7a6cc71110fe8de45e8c74d487ebd23b7942f34).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


SparkQA commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840310366


   **[Test build #138493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138493/testReport)**
 for PR 32515 at commit 
[`b8b54ea`](https://github.com/apache/spark/commit/b8b54ea9cb3bdbb8f50bdb260567dedd2af9fe1b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.

2021-05-12 Thread GitBox


HyukjinKwon commented on a change in pull request #32161:
URL: https://github.com/apache/spark/pull/32161#discussion_r631575367



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -812,46 +812,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   /**
* Loads a Parquet file, returning the result as a `DataFrame`.
*
-   * You can set the following Parquet-specific option(s) for reading Parquet 
files:
-   * 
-   * `mergeSchema` (default is the value specified in 
`spark.sql.parquet.mergeSchema`): sets
-   * whether we should merge schemas collected from all Parquet part-files. 
This will override
-   * `spark.sql.parquet.mergeSchema`.
-   * `pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
-   * the pattern. The syntax follows 
org.apache.hadoop.fs.GlobFilter.
-   * It does not change the behavior of partition discovery.
-   * `modifiedBefore` (batch only): an optional timestamp to only include 
files with
-   * modification times  occurring before the specified Time. The provided 
timestamp
-   * must be in the following form: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-   * `modifiedAfter` (batch only): an optional timestamp to only include 
files with
-   * modification times occurring after the specified Time. The provided 
timestamp
-   * must be in the following form: -MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-   * `recursiveFileLookup`: recursively scan a directory for files. Using 
this option
-   * disables partition discovery
-   * `datetimeRebaseMode` (default is the value specified in the SQL config
-   * `spark.sql.parquet.datetimeRebaseModeInRead`): the rebasing mode for the 
values
-   * of the `DATE`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS` logical types from 
the Julian to
-   * Proleptic Gregorian calendar:
-   *   
-   * `EXCEPTION` : Spark fails in reads of ancient dates/timestamps 
that are ambiguous
-   * between the two calendars
-   * `CORRECTED` : loading of dates/timestamps without rebasing
-   * `LEGACY` : perform rebasing of ancient dates/timestamps from the 
Julian to Proleptic
-   * Gregorian calendar
-   *   
-   * 
-   * `int96RebaseMode` (default is the value specified in the SQL config
-   * `spark.sql.parquet.int96RebaseModeInRead`): the rebasing mode for `INT96` 
timestamps
-   * from the Julian to Proleptic Gregorian calendar:
-   *   
-   * `EXCEPTION` : Spark fails in reads of ancient `INT96` timestamps 
that are ambiguous
-   * between the two calendars
-   * `CORRECTED` : loading of timestamps without rebasing
-   * `LEGACY` : perform rebasing of ancient `INT96` timestamps from 
the Julian to Proleptic
-   * Gregorian calendar
-   *   
-   * 
-   * 
+   * Parquet-specific option(s) for reading Parquet files can be found in
+   * https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option;>
+   *   Data Source Option in the version you use.

Review comment:
   can you add the general options here too




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840309736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138488/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32520:
URL: https://github.com/apache/spark/pull/32520#issuecomment-840309734


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138479/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32292:
URL: https://github.com/apache/spark/pull/32292#issuecomment-840309741


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43010/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840309740


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138478/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840309738


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840309740


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138478/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32292:
URL: https://github.com/apache/spark/pull/32292#issuecomment-840309741


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43010/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840309736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138488/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32520:
URL: https://github.com/apache/spark/pull/32520#issuecomment-840309734


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138479/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840309738


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32494:
URL: https://github.com/apache/spark/pull/32494#discussion_r631574179



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
##
@@ -111,6 +111,44 @@ object UnionEstimation {
   AttributeMap.empty[ColumnStat]
 }
 
+val attrToComputeNullCount = 
union.children.map(_.output).transpose.zipWithIndex.filter {
+  case (attrs, _) => attrs.zipWithIndex.forall {
+case (attr, childIndex) =>
+  val attrStats = union.children(childIndex).stats.attributeStats
+  attrStats.get(attr).isDefined && attrStats(attr).nullCount.isDefined
+  }
+}
+
+val newAttrStats = if (attrToComputeNullCount.nonEmpty) {
+  val outputAttrStats = new ArrayBuffer[(Attribute, ColumnStat)]()
+  attrToComputeNullCount.foreach {
+case (attrs, outputIndex) =>
+  val colWithNullStatValues = 
attrs.zipWithIndex.foldLeft[Option[BigInt]](None) {
+case (totalNullCount, (attr, childIndex)) =>
+  val colStat = 
union.children(childIndex).stats.attributeStats(attr)
+  if (totalNullCount.isDefined) {

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


SparkQA commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840308059


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


SparkQA commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840305304


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43009/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


HyukjinKwon commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840303599


   Looks okay to me too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


SparkQA commented on pull request #32292:
URL: https://github.com/apache/spark/pull/32292#issuecomment-840303409






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631566208



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
   private def checkStats(
   plan: LogicalPlan,
   expectedStatsCboOn: Statistics,
-  expectedStatsCboOff: Statistics): Unit = {
-withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  expectedStatsCboOff: Statistics,
+  extraConfigs: Map[String, String] = Map.empty): Unit = {
+

Review comment:
   Yes, removed the extra line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


sunchao commented on a change in pull request #32527:
URL: https://github.com/apache/spark/pull/32527#discussion_r631565642



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression 
{
   arguments: Seq[Expression],
   input: InternalRow,
   dataType: DataType): Any = {
-val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
-if (needNullCheck && args.exists(_ == null)) {
+var i = 0
+val len = arguments.length
+while (i < len) {
+  evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]
+  i += 1
+}
+if (needNullCheck && evaluatedArgs.contains(null)) {
   // return null if one of arguments is null
   null
 } else {
   val ret = try {
-method.invoke(obj, args: _*)
+method.invoke(obj, evaluatedArgs: _*)
   } catch {

Review comment:
   Yea let me try it. In the profiling after this PR, `HashMap.get` takes 
7.82% from the entire `invoke` call so it seems worthwhile to do this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32520:
URL: https://github.com/apache/spark/pull/32520#issuecomment-840197479


   **[Test build #138479 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138479/testReport)**
 for PR 32520 at commit 
[`299abb5`](https://github.com/apache/spark/commit/299abb537bf715506d77079b65a4704a04a2829f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread GitBox


SparkQA commented on pull request #32520:
URL: https://github.com/apache/spark/pull/32520#issuecomment-840300886


   **[Test build #138479 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138479/testReport)**
 for PR 32520 at commit 
[`299abb5`](https://github.com/apache/spark/commit/299abb537bf715506d77079b65a4704a04a2829f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631565143



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
   private def checkStats(
   plan: LogicalPlan,
   expectedStatsCboOn: Statistics,
-  expectedStatsCboOff: Statistics): Unit = {
-withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  expectedStatsCboOff: Statistics,
+  extraConfigs: Map[String, String] = Map.empty): Unit = {
+

Review comment:
   I am not sure I understand you here. Do we need to directly put the 
histogram configs inside this method? By default histogram is disabled and 
number of bins default value is 254.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564790



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(4),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step") {
 val range = Range(-10, -20, -2, None)
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-18.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -12.0, 2)
+histogramBins(2) = HistogramBin(-12.0, -10.0, 1)

Review comment:
   Added assert to check if `range.numElements` and `ndv` are same

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+
 val range = Range(-10, -20, -3, None)
+
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-19.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -13.0, 1)
+histogramBins(2) = HistogramBin(-13.0, -10.0, 1)

Review comment:
   Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564612



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)
+percentileArray.tail.zipWithIndex.foreach { case (upperIndex, binId) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+
+  lowerBinValue = upperBinValue
+  lowerIndex = upperIndex
+}
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {

Review comment:
   Done

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564557



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)

Review comment:
   Yes, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840286547


   **[Test build #138488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138488/testReport)**
 for PR 32516 at commit 
[`702629c`](https://github.com/apache/spark/commit/702629ccead13baba006eab8a6340b49722bf60a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


SparkQA commented on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840298542


   **[Test build #138488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138488/testReport)**
 for PR 32516 at commit 
[`702629c`](https://github.com/apache/spark/commit/702629ccead13baba006eab8a6340b49722bf60a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


cloud-fan commented on a change in pull request #32527:
URL: https://github.com/apache/spark/pull/32527#discussion_r631561074



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression 
{
   arguments: Seq[Expression],
   input: InternalRow,
   dataType: DataType): Any = {
-val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
-if (needNullCheck && args.exists(_ == null)) {
+var i = 0
+val len = arguments.length
+while (i < len) {
+  evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]
+  i += 1
+}
+if (needNullCheck && evaluatedArgs.contains(null)) {
   // return null if one of arguments is null
   null
 } else {
   val ret = try {
-method.invoke(obj, args: _*)
+method.invoke(obj, evaluatedArgs: _*)
   } catch {

Review comment:
   We can do the similar thing in `Invoke.eval`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


cloud-fan commented on a change in pull request #32527:
URL: https://github.com/apache/spark/pull/32527#discussion_r631560800



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
##
@@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression 
{
   arguments: Seq[Expression],
   input: InternalRow,
   dataType: DataType): Any = {
-val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
-if (needNullCheck && args.exists(_ == null)) {
+var i = 0
+val len = arguments.length
+while (i < len) {
+  evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]
+  i += 1
+}
+if (needNullCheck && evaluatedArgs.contains(null)) {
   // return null if one of arguments is null
   null
 } else {
   val ret = try {
-method.invoke(obj, args: _*)
+method.invoke(obj, evaluatedArgs: _*)
   } catch {

Review comment:
   Can we also improve the last piece?
   ```
 val boxedClass = ScalaReflection.typeBoxedJavaMapping.get(dataType)
 if (boxedClass.isDefined) {
   boxedClass.get.cast(ret)
 } else {
   ret
 }
   ```
   We can create a function for it
   ```
   private lazy val boxing: Any => Any = 
ScalaReflection.typeBoxedJavaMapping.get(dataType).map(_.cast(_)).getOrElse(identity)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840190295


   **[Test build #138478 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138478/testReport)**
 for PR 32494 at commit 
[`c929124`](https://github.com/apache/spark/commit/c929124f5ce2045da43314941d513b57ce9d553a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840293326


   **[Test build #138478 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138478/testReport)**
 for PR 32494 at commit 
[`c929124`](https://github.com/apache/spark/commit/c929124f5ce2045da43314941d513b57ce9d553a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840292938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138477/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840190243


   **[Test build #138477 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138477/testReport)**
 for PR 32498 at commit 
[`0bb49b3`](https://github.com/apache/spark/commit/0bb49b3a15b4bf2c59916cce91d5aba285812079).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32494:
URL: https://github.com/apache/spark/pull/32494#discussion_r631558692



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
##
@@ -111,6 +111,44 @@ object UnionEstimation {
   AttributeMap.empty[ColumnStat]
 }
 
+val attrToComputeNullCount = 
union.children.map(_.output).transpose.zipWithIndex.filter {
+  case (attrs, _) => attrs.zipWithIndex.forall {
+case (attr, childIndex) =>
+  val attrStats = union.children(childIndex).stats.attributeStats
+  attrStats.get(attr).isDefined && attrStats(attr).nullCount.isDefined
+  }
+}
+
+val newAttrStats = if (attrToComputeNullCount.nonEmpty) {
+  val outputAttrStats = new ArrayBuffer[(Attribute, ColumnStat)]()
+  attrToComputeNullCount.foreach {
+case (attrs, outputIndex) =>
+  val colWithNullStatValues = 
attrs.zipWithIndex.foldLeft[Option[BigInt]](None) {
+case (totalNullCount, (attr, childIndex)) =>
+  val colStat = 
union.children(childIndex).stats.attributeStats(attr)
+  if (totalNullCount.isDefined) {

Review comment:
   If the first element can be null only, could we remove this if?
   ```
 val firstStat = 
union.children.head.stats.attributeStats(attrs.head)
 val firstNullCount = firstStat.nullCount.get
 attrs.zipWithIndex.tail.foldLeft(firstNullCount) {...}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840292283


   **[Test build #138477 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138477/testReport)**
 for PR 32498 at commit 
[`0bb49b3`](https://github.com/apache/spark/commit/0bb49b3a15b4bf2c59916cce91d5aba285812079).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32531:
URL: https://github.com/apache/spark/pull/32531#issuecomment-840291144


   Could you review this, @attilapiros ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


SparkQA commented on pull request #32204:
URL: https://github.com/apache/spark/pull/32204#issuecomment-840291088


   **[Test build #138492 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138492/testReport)**
 for PR 32204 at commit 
[`a386788`](https://github.com/apache/spark/commit/a386788b44fb5255d2784ce423e3f879ba97f53c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


SparkQA commented on pull request #32531:
URL: https://github.com/apache/spark/pull/32531#issuecomment-840290823


   **[Test build #138491 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138491/testReport)**
 for PR 32531 at commit 
[`c6ce0b7`](https://github.com/apache/spark/commit/c6ce0b720c114b962e73af1a989eb113df3ec70a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file

2021-05-12 Thread GitBox


dongjoon-hyun opened a new pull request #32531:
URL: https://github.com/apache/spark/pull/32531


   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

2021-05-12 Thread GitBox


itholic commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r631553255



##
File path: python/pyspark/sql/streaming.py
##
@@ -504,105 +504,13 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 path : str
 string represents path to the JSON dataset,
 or RDD of Strings storing JSON objects.
-schema : :class:`pyspark.sql.types.StructType` or str, optional

Review comment:
   I added it to Data Source Options table!
   
   https://user-images.githubusercontent.com/44108233/118077601-62bc5f00-b3ef-11eb-9350-c62b370e167c.png;>
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631552581



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(4),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step") {
 val range = Range(-10, -20, -2, None)
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-18.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -12.0, 2)
+histogramBins(2) = HistogramBin(-12.0, -10.0, 1)

Review comment:
   Could you check if `range.numElements` and the total `ndv` are the same?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840287170


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138487/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840287170


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138487/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


SparkQA removed a comment on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840264912


   **[Test build #138487 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138487/testReport)**
 for PR 32199 at commit 
[`6ac4590`](https://github.com/apache/spark/commit/6ac459047c8168f750fe483606c62fc85b7effec).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631552121



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
   private def checkStats(
   plan: LogicalPlan,
   expectedStatsCboOn: Statistics,
-  expectedStatsCboOff: Statistics): Unit = {
-withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  expectedStatsCboOff: Statistics,
+  extraConfigs: Map[String, String] = Map.empty): Unit = {
+

Review comment:
   nit: remove this unnecessary change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


SparkQA commented on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840286890


   **[Test build #138487 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138487/testReport)**
 for PR 32199 at commit 
[`6ac4590`](https://github.com/apache/spark/commit/6ac459047c8168f750fe483606c62fc85b7effec).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


SparkQA commented on pull request #32292:
URL: https://github.com/apache/spark/pull/32292#issuecomment-840286781


   **[Test build #138490 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138490/testReport)**
 for PR 32292 at commit 
[`774bda1`](https://github.com/apache/spark/commit/774bda13487ab0823e20d0295c6e7108a5a62b83).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631552022



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+
 val range = Range(-10, -20, -3, None)
+
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-19.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -13.0, 1)
+histogramBins(2) = HistogramBin(-13.0, -10.0, 1)

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631551951



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(4),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step") {
 val range = Range(-10, -20, -2, None)
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-18.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -12.0, 2)
+histogramBins(2) = HistogramBin(-12.0, -10.0, 1)

Review comment:
   nit:
   ```
   val histogramBins = Array(
 HistogramBin(-18.0, -16.0, 2),
 HistogramBin(-16.0, -12.0, 2),
 HistogramBin(-12.0, -10.0, 1))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

2021-05-12 Thread GitBox


SparkQA commented on pull request #32515:
URL: https://github.com/apache/spark/pull/32515#issuecomment-840286591


   **[Test build #138489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138489/testReport)**
 for PR 32515 at commit 
[`ad18acc`](https://github.com/apache/spark/commit/ad18acca9e991251fa44d33f53e8c4648fbcdbb7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes

2021-05-12 Thread GitBox


SparkQA commented on pull request #32516:
URL: https://github.com/apache/spark/pull/32516#issuecomment-840286547


   **[Test build #138488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138488/testReport)**
 for PR 32516 at commit 
[`702629c`](https://github.com/apache/spark/commit/702629ccead13baba006eab8a6340b49722bf60a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840286024


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32527:
URL: https://github.com/apache/spark/pull/32527#issuecomment-840286021


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138475/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840286023


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43006/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840286022


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join

2021-05-12 Thread GitBox


AmplabJenkins removed a comment on pull request #32528:
URL: https://github.com/apache/spark/pull/32528#issuecomment-840286026


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138476/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840286023


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43006/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32527:
URL: https://github.com/apache/spark/pull/32527#issuecomment-840286021


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138475/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840286022


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32528:
URL: https://github.com/apache/spark/pull/32528#issuecomment-840286026


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138476/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


AmplabJenkins commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840286024


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631551107



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+

Review comment:
   nit: please revert this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-12 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631550114



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,38 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION " +
+s"'file:$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"CREATE TABLE ctas_with_existing_location stored as rcfile " +
+s"LOCATION 'file:$tempLocation' " +
+s"AS SELECT key k, value FROM src ORDER BY k, value")
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>
+  Seq("false", "true").foreach { allowNonEmptyDirFlag =>

Review comment:
   withSQLConf accepts pairs of String (String, String), passing (String, 
Boolean) will fail to compile 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840283366


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32199: [SPARK-35100][ML] Refactor AFT - support virtual centering

2021-05-12 Thread GitBox


SparkQA commented on pull request #32199:
URL: https://github.com/apache/spark/pull/32199#issuecomment-840282546


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32494:
URL: https://github.com/apache/spark/pull/32494#issuecomment-840282473






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


gengliangwang commented on pull request #32292:
URL: https://github.com/apache/spark/pull/32292#issuecomment-840281783


   > Just out of curiosity; Any reason to pick up try_add+try_divide instead of 
try_add+try_multiple?
   
   IMO, divide by 0 error is more common in ETL/ML jobs than integral multiply 
overflow. I also talked to @bart-samwel, from his experience on Bigquery, the 
most desired functions are `try_cast` and `try_divide`.
   We can add `TRY_SUBTRACT`/`TRY_MULTIPLY` if many users want it. Before that, 
let's be cautious in adding such new functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631548474



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)
+percentileArray.tail.zipWithIndex.foreach { case (upperIndex, binId) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+
+  lowerBinValue = upperBinValue
+  lowerIndex = upperIndex
+}
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {

Review comment:
   ```
 private def getRangeValue(index: Int): Long = {
   assert(index >= 0, "index must be greater than and equal to 0")
   if (step == 0) {
 start + (numElements.toLong - index - 1) * step
   } else {
 start + index * step
   }
 }
   ```
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


SparkQA commented on pull request #32498:
URL: https://github.com/apache/spark/pull/32498#issuecomment-840281207


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

2021-05-12 Thread GitBox


gengliangwang commented on a change in pull request #32292:
URL: https://github.com/apache/spark/pull/32292#discussion_r631546771



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
##
@@ -320,6 +320,8 @@ object FunctionRegistry {
 expression[Stack]("stack"),
 expression[CaseWhen]("when"),
 
+expression[TryAdd]("try_add"),
+expression[TryDivide]("try_divide"),

Review comment:
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryEval.scala
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.DataType
+
+private[catalyst] case class TryEval(child: Expression)

Review comment:
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryEval.scala
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.DataType
+
+private[catalyst] case class TryEval(child: Expression)
+extends UnaryExpression with NullIntolerant {
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val childGen = child.genCode(ctx)
+ev.copy(code = code"""
+  boolean ${ev.isNull} = true;
+  ${CodeGenerator.javaType(dataType)} ${ev.value} = 
${CodeGenerator.defaultValue(dataType)};
+  try {
+${childGen.code}
+${ev.isNull} = ${childGen.isNull};
+${ev.value} = ${childGen.value};
+  } catch (Exception e) {
+  }"""
+)
+  }
+
+  override def eval(input: InternalRow): Any =
+try {
+  child.eval(input)
+} catch {
+  case _: Exception =>
+null
+}
+
+  override def dataType: DataType = child.dataType
+
+  override def nullable: Boolean = true
+
+  override protected def withNewChildInternal(newChild: Expression): 
Expression =
+copy(child = newChild)
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr1, expr2) - Returns `expr1`+`expr2` and the result is 
null on overflow.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(1, 2);
+   3

Review comment:
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryEval.scala
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License 

[GitHub] [spark] maropu commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


maropu commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631546439



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)

Review comment:
   It looks we can remove `var` by using `foldLeft` instead of `foreach`. 
Could you?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


dongjoon-hyun closed pull request #32527:
URL: https://github.com/apache/spark/pull/32527


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32527:
URL: https://github.com/apache/spark/pull/32527#issuecomment-840275942


   Thank you, @sunchao and all! Merged to master for Apache Spark 3.2.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >