[GitHub] [spark] AmplabJenkins removed a comment on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #24173:
URL: https://github.com/apache/spark/pull/24173#issuecomment-622258664







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #24173:
URL: https://github.com/apache/spark/pull/24173#issuecomment-622258664







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart

2020-04-30 Thread GitBox


SparkQA commented on pull request #24173:
URL: https://github.com/apache/spark/pull/24173#issuecomment-622258120


   **[Test build #122152 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122152/testReport)**
 for PR 24173 at commit 
[`1fcfff5`](https://github.com/apache/spark/commit/1fcfff5c2ca78049eb38cf4ef7c041d0005ab9b3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart

2020-04-30 Thread GitBox


SparkQA removed a comment on pull request #24173:
URL: https://github.com/apache/spark/pull/24173#issuecomment-622196758


   **[Test build #122152 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122152/testReport)**
 for PR 24173 at commit 
[`1fcfff5`](https://github.com/apache/spark/commit/1fcfff5c2ca78049eb38cf4ef7c041d0005ab9b3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28427:
URL: https://github.com/apache/spark/pull/28427#discussion_r418424358



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -206,9 +204,6 @@
 | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | 
struct |
 | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT 
quarter('2016-08-31') | struct |
 | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT 
'%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | 
struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> |
-| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | 
struct |

Review comment:
   Yea, I think this golden file was generated in master, so I re-generated 
it in branch-3.0. Btw, we need this test suite in branch-3.0? Reverting #28194 
from branch-3.0 is another simple solution to recover it, I think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622256704







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622256704







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


SparkQA commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622256412


   **[Test build #122161 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122161/testReport)**
 for PR 28383 at commit 
[`c848508`](https://github.com/apache/spark/commit/c8485086ffee7e5ca2ad7a1e24e10081070d135d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sathyaprakashg commented on pull request #28222: SPARK-31447 Fix issue in ExtractIntervalPart expression

2020-04-30 Thread GitBox


sathyaprakashg commented on pull request #28222:
URL: https://github.com/apache/spark/pull/28222#issuecomment-622256424


   @cloud-fan @yaooqinn for great discussion on this. 
   
   @yaooqinn has fixed getDays in the below PR and for getMonths we can agree 
on keeping the existing behavior as appropriate.
   
   https://github.com/apache/spark/pull/28396
   
   I have another quick question, before we close this PR.
   
   SubtractTimestamps sems to be always populating only microseconds of 
CalendarInterval
   
https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2242
   
   Whereas SubtractDates seems to be populating months along with days field.
   
https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2261
   
   
https://github.com/apache/spark/blob/ea525fe8c0cc6336a7ba8d98bada3198795f8aed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L940
   
   My question is whether we should change SubtractTimestamps also to populate 
appropraiate months and days fields as well in CalendarInterval or whether you 
guys feel existing implementation is appropriate



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #27649:
URL: https://github.com/apache/spark/pull/27649#issuecomment-622255774







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #27649:
URL: https://github.com/apache/spark/pull/27649#issuecomment-622255774







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch

2020-04-30 Thread GitBox


SparkQA removed a comment on pull request #27649:
URL: https://github.com/apache/spark/pull/27649#issuecomment-622196761


   **[Test build #122151 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122151/testReport)**
 for PR 27649 at commit 
[`6406e36`](https://github.com/apache/spark/commit/6406e36eb34377983aaf113495ca16b1553317a3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28420:
URL: https://github.com/apache/spark/pull/28420#issuecomment-622254976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch

2020-04-30 Thread GitBox


SparkQA commented on pull request #27649:
URL: https://github.com/apache/spark/pull/27649#issuecomment-622255280


   **[Test build #122151 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122151/testReport)**
 for PR 27649 at commit 
[`6406e36`](https://github.com/apache/spark/commit/6406e36eb34377983aaf113495ca16b1553317a3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28420:
URL: https://github.com/apache/spark/pull/28420#issuecomment-622254976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


SparkQA commented on pull request #28420:
URL: https://github.com/apache/spark/pull/28420#issuecomment-622254725


   **[Test build #122160 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122160/testReport)**
 for PR 28420 at commit 
[`f6bc861`](https://github.com/apache/spark/commit/f6bc8611e04255c575799bd31effeadee13d0f8f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


beliefer commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418417068



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.tags.ExtendedSQLTest
+
+// scalastyle:off line.size.limit
+/**
+ * End-to-end test cases for SQL schemas of expression examples.
+ * The golden result file is 
"spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md".
+ *
+ * To run the entire test suite:
+ * {{{
+ *   build/sbt "sql/test-only *ExpressionsSchemaSuite"
+ * }}}
+ *
+ * To re-generate golden files for entire suite, run:
+ * {{{
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only 
*ExpressionsSchemaSuite"
+ * }}}
+ *
+ * For example:
+ * {{{
+ *   ...
+ *   @ExpressionDescription(
+ * usage = "_FUNC_(str, n) - Returns the string which repeats the given 
string value n times.",
+ * examples = """
+ *   Examples:
+ * > SELECT _FUNC_('123', 2);
+ *  123123
+ * """,
+ * since = "1.5.0")
+ *   case class StringRepeat(str: Expression, times: Expression)
+ *   ...
+ * }}}
+ *
+ * The format for golden result files look roughly like:
+ * {{{
+ *   ...
+ *   | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | 
SELECT repeat('123', 2) | struct |
+ *   ...
+ * }}}
+ */
+// scalastyle:on line.size.limit
+@ExtendedSQLTest
+class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession {
+
+  private val regenerateGoldenFiles: Boolean = 
System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1"
+
+  private val baseResourcePath = {
+// We use a path based on Spark home for 2 reasons:
+//   1. Maven can't get correct resource directory when resources in other 
jars.
+//   2. We test subclasses in the hive-thriftserver module.
+val sparkHome = {
+  assert(sys.props.contains("spark.test.home") ||
+sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not 
set.")
+  sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
+}
+
+java.nio.file.Paths.get(sparkHome,
+  "sql", "core", "src", "test", "resources", "sql-functions").toFile
+  }
+
+  private val resultFile = new File(baseResourcePath, 
"sql-expression-schema.md")
+
+  /** A single SQL query's SQL and schema. */
+  protected case class QueryOutput(
+  className: String,
+  funcName: String,
+  sql: String = "N/A",
+  schema: String = "N/A") {
+override def toString: String = {
+  s"| $className | $funcName | $sql | $schema |"
+}
+  }
+
+  test("Check schemas for expression examples") {
+val exampleRe = """^(.+);\n(?s)(.+)$""".r
+val funInfos = spark.sessionState.functionRegistry.listFunction().map { 
funcId =>
+  spark.sessionState.catalog.lookupFunctionInfo(funcId)
+}
+
+val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1)
+val outputBuffer = new ArrayBuffer[String]
+val outputs = new ArrayBuffer[QueryOutput]
+val missingExamples = new ArrayBuffer[String]
+
+classFunsMap.foreach { kv =>
+  val className = kv._1
+  kv._2.foreach { funInfo =>
+val example = funInfo.getExamples
+val funcName = funInfo.getName.replaceAll("\\|", "")
+if (example == "") {
+  val queryOutput = QueryOutput(className, funcName)
+  outputBuffer += queryOutput.toString
+  outputs += queryOutput
+  missingExamples += funcName
+}
+
+// If expression exists 'Examples' segment, the first element is 
'Examples'. Because
+// this test case is only used to print aliases of expressions for 
double checking.
+// Therefore, we only need to output the first SQL and its 
corresponding schema.
+// Note: We need to filter out the commands that set the parameters, 
such as:
+// SET 

[GitHub] [spark] viirya commented on a change in pull request #28351: [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread GitBox


viirya commented on a change in pull request #28351:
URL: https://github.com/apache/spark/pull/28351#discussion_r418421293



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
##
@@ -139,6 +148,32 @@ case class CollectSet(
 
   def this(child: Expression) = this(child, 0, 0)
 
+  /*
+   * SPARK-31500
+   * Array[Byte](BinaryType) Scala equality don't works as expected
+   * so HashSet return duplicates, we have to change types to drop
+   * this duplicates and make collect_set work as expected for this
+   * data type
+   */
+  override lazy val bufferElementType = child.dataType match {
+case BinaryType => ArrayType(BinaryType)

Review comment:
   ArrayType(ByteType)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28164:
URL: https://github.com/apache/spark/pull/28164#issuecomment-622253287







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28164:
URL: https://github.com/apache/spark/pull/28164#issuecomment-622253287







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu removed a comment on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


maropu removed a comment on pull request #28420:
URL: https://github.com/apache/spark/pull/28420#issuecomment-622252926


   I feel having the pretty functionality in `RuntimeReplaceable` is a bit 
intrusive. Dropping the idea above is okay, but I think we need a more general 
solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? 
@cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28420:
URL: https://github.com/apache/spark/pull/28420#discussion_r418348605



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
##
@@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   // two `RuntimeReplaceable` are considered to be semantically equal if their 
"child" expressions
   // are semantically equal.
   override lazy val canonicalized: Expression = child.canonicalized
+
+  override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable 
must implement" +
+" innerChildren with the original parameters")
+
+  protected val sqlStrSeparator: String = ", "
+
+  override def sql: String = RuntimeReplaceable.this.prettyName +
+prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")")
+
+  protected var prettyChildren: Seq[Expression] = innerChildren

Review comment:
   I think its better not to use `var` where possible, so how about it like 
this?
   ```
 // RuntimeReplaceable
 def newInstance(innerChildren: Seq[Expression]): RuntimeReplaceable
   
 // e.g., ParseToTimestamp
 override def newInstance(innerChildren: Seq[Expression]): ParseToTimestamp 
= {
   innerChildren match {
 case Seq(left) => ParseToTimestamp(left, None, child)
 case Seq(left, format) => ParseToTimestamp(left, Some(format), child)
   }
 }
   ```
   Then, update the `toPrettySQL` in `utils.package`;
   ```
 def toPrettySQL(e: Expression): String = e match {
   case r: RuntimeReplaceable =>
 r.newInstance(r.innerChildren.map(usePrettyExpression)).sql
   case _ => usePrettyExpression(e).sql
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName

2020-04-30 Thread GitBox


SparkQA commented on pull request #28164:
URL: https://github.com/apache/spark/pull/28164#issuecomment-622252974


   **[Test build #122159 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122159/testReport)**
 for PR 28164 at commit 
[`f048601`](https://github.com/apache/spark/commit/f048601421d12ada878880c6a71f43b1bff3adf3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28427:
URL: https://github.com/apache/spark/pull/28427#discussion_r418420692



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -206,9 +204,6 @@
 | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | 
struct |
 | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT 
quarter('2016-08-31') | struct |
 | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT 
'%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | 
struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> |
-| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | 
struct |

Review comment:
   I think @beliefer meant https://github.com/apache/spark/pull/28392 is 
only landed into master and we should keep these just as are at 
https://github.com/apache/spark/pull/28194. I assume `rand(seed)` was printed 
out because this test was ran against the latest `master` instead of 
`branch-3.0` at 
https://github.com/apache/spark/pull/28427/commits/948a54e7059b29a1a001315b0ed2d69ec4104314
 (?).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622252767


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/122155/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


SparkQA removed a comment on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622231819


   **[Test build #122155 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)**
 for PR 28427 at commit 
[`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622252763







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


maropu commented on pull request #28420:
URL: https://github.com/apache/spark/pull/28420#issuecomment-622252926


   I feel having the pretty functionality in `RuntimeReplaceable` is a bit 
intrusive. Dropping the idea above is okay, but I think we need a more general 
solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? 
@cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622252763


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28427:
URL: https://github.com/apache/spark/pull/28427#discussion_r418420692



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -206,9 +204,6 @@
 | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | 
struct |
 | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT 
quarter('2016-08-31') | struct |
 | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT 
'%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | 
struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> |
-| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | 
struct |
-| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | 
struct |

Review comment:
   I think @beliefer meant https://github.com/apache/spark/pull/28392 is 
only landed into master and we should keep these just as are at 
https://github.com/apache/spark/pull/28194. I assume this seed were printed out 
because this test was ran against the latest `master` instead of `branch-3.0` 
at 
https://github.com/apache/spark/pull/28427/commits/948a54e7059b29a1a001315b0ed2d69ec4104314





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28420:
URL: https://github.com/apache/spark/pull/28420#discussion_r418420802



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
##
@@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   // two `RuntimeReplaceable` are considered to be semantically equal if their 
"child" expressions
   // are semantically equal.
   override lazy val canonicalized: Expression = child.canonicalized
+
+  override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable 
must implement" +
+" innerChildren with the original parameters")
+
+  protected val sqlStrSeparator: String = ", "
+
+  override def sql: String = RuntimeReplaceable.this.prettyName +
+prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")")
+
+  protected var prettyChildren: Seq[Expression] = innerChildren

Review comment:
   I feel having the pretty functionality in `RuntimeReplaceable` is a bit 
intrusive. Dropping the idea above is okay, but I think we need a more general 
solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? 
@cloud-fan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


SparkQA commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622252679


   **[Test build #122155 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)**
 for PR 28427 at commit 
[`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622251633







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622251633







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


SparkQA commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622251397


   **[Test build #122158 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122158/testReport)**
 for PR 28427 at commit 
[`db16f76`](https://github.com/apache/spark/commit/db16f762318197eb82ed962bc89daf3d4749ac80).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418419246



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+

Review comment:
   Hopefully we can improve the diff here ...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28351: [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread GitBox


viirya commented on a change in pull request #28351:
URL: https://github.com/apache/spark/pull/28351#discussion_r418418880



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##
@@ -530,6 +530,26 @@ class DataFrameAggregateSuite extends QueryTest
 )
   }
 
+  test("SPARK-31500: collect_set() of BinaryType returns duplicate elements") {
+val bytesTest1 = "test1".getBytes
+val bytesTest2 = "test2".getBytes
+val df = Seq(bytesTest1, bytesTest1, bytesTest2).toDF("a")
+val ret = df.select(collect_set($"a")).collect()
+  .map(r => r.getAs[Seq[_]](0)).head
+assert(ret.length == 2)
+
+val a = "aa".getBytes
+val b = "bb".getBytes
+val c = "cc".getBytes
+val d = "dd".getBytes
+val df1 = Seq((a, b), (a, b), (c, d))
+  .toDF("x", "y")
+  .select(struct($"x", $"y").as("a"))
+val ret1 = df1.select(collect_set($"a")).collect()
+  .map(r => r.getAs[Seq[_]](0)).head
+assert(ret1.length == 2)

Review comment:
   +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418418796



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+

Review comment:
   Okay, but seems it's more difficult to track the diff. See 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3/431/testReport/junit/org.apache.spark.sql/ExpressionsSchemaSuite/Check_schemas_for_expression_examples/
   
   ![Screen Shot 2020-05-01 at 2 06 10 
PM](https://user-images.githubusercontent.com/6477701/80783491-f5a7a400-8bb4-11ea-9e91-880719bd472c.png)
   ![Screen Shot 2020-05-01 at 2 06 04 
PM](https://user-images.githubusercontent.com/6477701/80783494-f7716780-8bb4-11ea-9425-dfc348644649.png)
   ![Screen Shot 2020-05-01 at 2 05 58 
PM](https://user-images.githubusercontent.com/6477701/80783495-f8a29480-8bb4-11ea-8b7a-eb16cbcc9a8e.png)
   ![Screen Shot 2020-05-01 at 2 05 48 
PM](https://user-images.githubusercontent.com/6477701/80783498-f9d3c180-8bb4-11ea-82b4-6f865294a21f.png)
   ![Screen Shot 2020-05-01 at 2 05 42 
PM](https://user-images.githubusercontent.com/6477701/80783501-fa6c5800-8bb4-11ea-9ce6-f562aebf6b6c.png)
   ![Screen Shot 2020-05-01 at 2 05 35 
PM](https://user-images.githubusercontent.com/6477701/80783504-fb9d8500-8bb4-11ea-9a9a-f0e5eadcb596.png)
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


yaooqinn commented on a change in pull request #28420:
URL: https://github.com/apache/spark/pull/28420#discussion_r418418394



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
##
@@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   // two `RuntimeReplaceable` are considered to be semantically equal if their 
"child" expressions
   // are semantically equal.
   override lazy val canonicalized: Expression = child.canonicalized
+
+  override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable 
must implement" +
+" innerChildren with the original parameters")
+
+  protected val sqlStrSeparator: String = ", "
+
+  override def sql: String = RuntimeReplaceable.this.prettyName +
+prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")")
+
+  protected var prettyChildren: Seq[Expression] = innerChildren

Review comment:
   The `newInstance` seems not robust to make a clone with `Seq[Expression]`
   
   I think we can just add  `def prettySQL: String` in RuntimeReplaceable for  
`utils.package` to call, and override it wherever needed without creating new 
instances.
   
   Seems a better and simpler idea that comes up based on your suggestions, 
WDYT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28383:
URL: https://github.com/apache/spark/pull/28383#discussion_r418417828



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
##
@@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest 
with SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-31590 The filter used by Metadata-only queries should not have 
Unevaluable") {
+withTable("test_tbl") {
+  withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
+sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")

Review comment:
   Can we reuse `testMetadataOnly`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


beliefer commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418417068



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.tags.ExtendedSQLTest
+
+// scalastyle:off line.size.limit
+/**
+ * End-to-end test cases for SQL schemas of expression examples.
+ * The golden result file is 
"spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md".
+ *
+ * To run the entire test suite:
+ * {{{
+ *   build/sbt "sql/test-only *ExpressionsSchemaSuite"
+ * }}}
+ *
+ * To re-generate golden files for entire suite, run:
+ * {{{
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only 
*ExpressionsSchemaSuite"
+ * }}}
+ *
+ * For example:
+ * {{{
+ *   ...
+ *   @ExpressionDescription(
+ * usage = "_FUNC_(str, n) - Returns the string which repeats the given 
string value n times.",
+ * examples = """
+ *   Examples:
+ * > SELECT _FUNC_('123', 2);
+ *  123123
+ * """,
+ * since = "1.5.0")
+ *   case class StringRepeat(str: Expression, times: Expression)
+ *   ...
+ * }}}
+ *
+ * The format for golden result files look roughly like:
+ * {{{
+ *   ...
+ *   | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | 
SELECT repeat('123', 2) | struct |
+ *   ...
+ * }}}
+ */
+// scalastyle:on line.size.limit
+@ExtendedSQLTest
+class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession {
+
+  private val regenerateGoldenFiles: Boolean = 
System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1"
+
+  private val baseResourcePath = {
+// We use a path based on Spark home for 2 reasons:
+//   1. Maven can't get correct resource directory when resources in other 
jars.
+//   2. We test subclasses in the hive-thriftserver module.
+val sparkHome = {
+  assert(sys.props.contains("spark.test.home") ||
+sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not 
set.")
+  sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
+}
+
+java.nio.file.Paths.get(sparkHome,
+  "sql", "core", "src", "test", "resources", "sql-functions").toFile
+  }
+
+  private val resultFile = new File(baseResourcePath, 
"sql-expression-schema.md")
+
+  /** A single SQL query's SQL and schema. */
+  protected case class QueryOutput(
+  className: String,
+  funcName: String,
+  sql: String = "N/A",
+  schema: String = "N/A") {
+override def toString: String = {
+  s"| $className | $funcName | $sql | $schema |"
+}
+  }
+
+  test("Check schemas for expression examples") {
+val exampleRe = """^(.+);\n(?s)(.+)$""".r
+val funInfos = spark.sessionState.functionRegistry.listFunction().map { 
funcId =>
+  spark.sessionState.catalog.lookupFunctionInfo(funcId)
+}
+
+val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1)
+val outputBuffer = new ArrayBuffer[String]
+val outputs = new ArrayBuffer[QueryOutput]
+val missingExamples = new ArrayBuffer[String]
+
+classFunsMap.foreach { kv =>
+  val className = kv._1
+  kv._2.foreach { funInfo =>
+val example = funInfo.getExamples
+val funcName = funInfo.getName.replaceAll("\\|", "")
+if (example == "") {
+  val queryOutput = QueryOutput(className, funcName)
+  outputBuffer += queryOutput.toString
+  outputs += queryOutput
+  missingExamples += funcName
+}
+
+// If expression exists 'Examples' segment, the first element is 
'Examples'. Because
+// this test case is only used to print aliases of expressions for 
double checking.
+// Therefore, we only need to output the first SQL and its 
corresponding schema.
+// Note: We need to filter out the commands that set the parameters, 
such as:
+// SET 

[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


maropu commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622248576


   Ah, yes. We need to ignore them in branch-3.0. I'll update.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28383:
URL: https://github.com/apache/spark/pull/28383#discussion_r418415403



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
##
@@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest 
with SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-31590 The filter used by Metadata-only queries should not have 
Unevaluable") {
+withTable("test_tbl") {
+  withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
+sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")

Review comment:
   Can you make the test case minimised, and consistent with the style used 
in this file? I think you can create the partitioned table via 
`write.parquetby` syntax instead of relying on the SQL syntax here even when 
you create tables.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622248593


   Looks fine to me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


beliefer commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622248015


   @maropu, It seems we not change the schema of rand() in branch-3.0? Why we 
preserve the distinguish between master and branch-3.0 ?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28383:
URL: https://github.com/apache/spark/pull/28383#discussion_r418415403



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
##
@@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest 
with SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-31590 The filter used by Metadata-only queries should not have 
Unevaluable") {
+withTable("test_tbl") {
+  withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
+sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")

Review comment:
   Can you make the test case minimised, and consistent with the style used 
in this file? I think you can create the partitioned table via 
`write.parquetby` syntax instead of relying on the SQL syntax here even when 
you create tables.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28383:
URL: https://github.com/apache/spark/pull/28383#discussion_r418415403



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
##
@@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest 
with SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-31590 The filter used by Metadata-only queries should not have 
Unevaluable") {
+withTable("test_tbl") {
+  withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
+sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")

Review comment:
   Can you make the test case minimised, and consistent with the style used 
in this file? I think you can create the partitioned table via 
`write.parquetby` syntax instead of relying on the SQL syntax here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26624: [SPARK-8981][core] Add MDC support in Executor

2020-04-30 Thread GitBox


cloud-fan commented on a change in pull request #26624:
URL: https://github.com/apache/spark/pull/26624#discussion_r418414716



##
File path: core/src/main/scala/org/apache/spark/executor/Executor.scala
##
@@ -674,6 +677,18 @@ private[spark] class Executor(
 }
   }
 
+  private def setMDCForTask(taskDescription: TaskDescription): Unit = {
+val properties = taskDescription.properties
+
+org.slf4j.MDC.put("appId", properties.getProperty("spark.app.id"))

Review comment:
   And where do we set app id/name to `task.localProperties`? Do users need 
to do it manually?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418414748



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##
@@ -46,6 +47,35 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
 Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result)
   }
 
+  /**
+   * Shorthand for calling redactString() without specifying redacting rules
+   */
+  protected def redact(text: String): String = {
+Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text)
+  }
+
+
+  override def verboseStringWithOperatorId(): String = {
+val metaDataStr = if (scan.isInstanceOf[SupportsMetadata]) {
+  val scanWithExplainSupport: SupportsMetadata = 
scan.asInstanceOf[SupportsMetadata]
+  val metadataStr = 
scanWithExplainSupport.getMetaData.toSeq.sorted.filterNot {
+case (_, value) if (value.isEmpty || value.equals("[]")) => true
+case (_, _) => false
+  }.map {
+case (key, value) => s"$key: ${redact(value)}"
+  }

Review comment:
   nit: format
   ```
   val metaDataStr = scan match {
 case s: SupportsMetadata =>
   s.getMetaData().toSeq.sorted.flatMap {
 case (key, value) if value.isEmpty || value.equals("[]") =>
   Some(s"$key: ${redact(value)}")
 case _ =>
   None
   }
 case _ =>
   Seq(scan.description())
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions

2020-04-30 Thread GitBox


yaooqinn commented on a change in pull request #28420:
URL: https://github.com/apache/spark/pull/28420#discussion_r418414564



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
##
@@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   // two `RuntimeReplaceable` are considered to be semantically equal if their 
"child" expressions
   // are semantically equal.
   override lazy val canonicalized: Expression = child.canonicalized
+
+  override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable 
must implement" +

Review comment:
   Good Point!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28406: [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase

2020-04-30 Thread GitBox


cloud-fan commented on a change in pull request #28406:
URL: https://github.com/apache/spark/pull/28406#discussion_r418414225



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -342,6 +380,43 @@ public void readLongs(
 }
   }
 
+  // A fork of `readLongs`, which rebases the timestamp long value 
(microseconds) before filling
+  // the Spark column vector.
+  public void readLongsWithRebase(
+  int total,
+  WritableColumnVector c,
+  int rowId,
+  int level,
+  VectorizedValuesReader data) throws IOException {
+int left = total;
+while (left > 0) {
+  if (this.currentCount == 0) this.readNextGroup();
+  int n = Math.min(left, this.currentCount);
+  switch (mode) {
+case RLE:
+  if (currentValue == level) {
+data.readLongsWithRebase(n, c, rowId);
+  } else {
+c.putNulls(rowId, n);
+  }
+  break;
+case PACKED:

Review comment:
   The general idea is to add an extra loop to check if we need to rebase 
or not, and it's only worthwhile if the no-rebase code path is much faster than 
the rebase code path.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28406: [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase

2020-04-30 Thread GitBox


cloud-fan commented on a change in pull request #28406:
URL: https://github.com/apache/spark/pull/28406#discussion_r418414012



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -342,6 +380,43 @@ public void readLongs(
 }
   }
 
+  // A fork of `readLongs`, which rebases the timestamp long value 
(microseconds) before filling
+  // the Spark column vector.
+  public void readLongsWithRebase(
+  int total,
+  WritableColumnVector c,
+  int rowId,
+  int level,
+  VectorizedValuesReader data) throws IOException {
+int left = total;
+while (left > 0) {
+  if (this.currentCount == 0) this.readNextGroup();
+  int n = Math.min(left, this.currentCount);
+  switch (mode) {
+case RLE:
+  if (currentValue == level) {
+data.readLongsWithRebase(n, c, rowId);
+  } else {
+c.putNulls(rowId, n);
+  }
+  break;
+case PACKED:

Review comment:
   I didn't optimize this case because the no-rebase code path looks not 
very fast. It has a `if-else` in the loop.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-04-30 Thread GitBox


maropu commented on pull request #27233:
URL: https://github.com/apache/spark/pull/27233#issuecomment-622244872


   Could we also close the jira, too?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-04-30 Thread GitBox


cloud-fan commented on pull request #27233:
URL: https://github.com/apache/spark/pull/27233#issuecomment-622244255


   To put a conclusion: I think this PR does fix a "correctness" issue 
according to the SQL standard. But as @tgravescs said in 
https://github.com/apache/spark/pull/27233#issuecomment-578187736 , the current 
behavior looks reasonable as well, and is the same with Oracle/SQL Server.
   
   This is a very corner case, and most likely people don't care/



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan edited a comment on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-04-30 Thread GitBox


cloud-fan edited a comment on pull request #27233:
URL: https://github.com/apache/spark/pull/27233#issuecomment-622244255


   To put a conclusion: I think this PR does fix a "correctness" issue 
according to the SQL standard. But as @tgravescs said in 
https://github.com/apache/spark/pull/27233#issuecomment-578187736 , the current 
behavior looks reasonable as well, and is the same with Oracle/SQL Server.
   
   This is a very corner case, and most likely people don't care.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418413096



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala
##
@@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.internal.connector
+
+trait SupportsMetadata {

Review comment:
   Plz add some comment about what this class is used for?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622243652







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418409102



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala
##
@@ -65,4 +65,8 @@ case class AvroScan(
   }
 
   override def hashCode(): Int = super.hashCode()
+
+  override def getMetaData(): Map[String, String] = {
+super.metaData ++ Map("Format" -> "avro")

Review comment:
   Could we move all the `Format` metadata into the `FileScan.metadata` 
side?
   ```
"Format" -> s"${this.getClass.getSimpleName.replace("Scan", 
"").toLowerCase(Locale.ROOT)}"
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622243652







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622243283







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


SparkQA removed a comment on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622240171


   **[Test build #122156 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)**
 for PR 28348 at commit 
[`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622243283







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


SparkQA commented on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622243201


   **[Test build #122156 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)**
 for PR 28348 at commit 
[`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


SparkQA commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622243294


   **[Test build #122157 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122157/testReport)**
 for PR 28383 at commit 
[`86f28d5`](https://github.com/apache/spark/commit/86f28d51f343becc6a516f726d20efd4d12c7708).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


cloud-fan commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418412131



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+

Review comment:
   See https://github.com/apache/spark/pull/28194#issuecomment-613854769
   
   I guess it's easier for humans the read the table.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-620372971


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on pull request #28383:
URL: https://github.com/apache/spark/pull/28383#issuecomment-622241833


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418410438



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##
@@ -29,11 +29,13 @@ import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjectio
 import org.apache.spark.sql.connector.read.{Batch, InputPartition, Scan, 
Statistics, SupportsReportStatistics}
 import org.apache.spark.sql.execution.PartitionedFileUtil
 import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.connector.SupportsMetadata
 import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.Utils
 
-trait FileScan extends Scan with Batch with SupportsReportStatistics with 
Logging {
+trait FileScan extends Scan
+  with Batch with SupportsReportStatistics with Logging with SupportsMetadata {

Review comment:
   super nit: better to put `Logging` in the end? `with Batch with 
SupportsReportStatistics with SupportsMetadata  with Logging {`. I personally 
think we'd better group them by similar features.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28365: [SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28365:
URL: https://github.com/apache/spark/pull/28365#discussion_r418410881



##
File path: R/pkg/R/utils.R
##
@@ -354,8 +354,8 @@ varargsToStrEnv <- function(...) {
   } else {
 value <- pairs[[name]]
 if (!(is.logical(value) || is.numeric(value) || is.character(value) || 
is.null(value))) {
-  stop(paste0("Unsupported type for ", name, " : ", class(value),
-   ". Supported types are logical, numeric, character and NULL."), 
call. = FALSE)
+  stop("Unsupported type for ", name, " : ", toString(class(value)), 
". ",

Review comment:
   Quick question, why do we need to do `toString`? Were there any 
differences at `stop(paste(..., ...))` vs `stop(..., ...)`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418410438



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##
@@ -29,11 +29,13 @@ import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjectio
 import org.apache.spark.sql.connector.read.{Batch, InputPartition, Scan, 
Statistics, SupportsReportStatistics}
 import org.apache.spark.sql.execution.PartitionedFileUtil
 import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.connector.SupportsMetadata
 import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.Utils
 
-trait FileScan extends Scan with Batch with SupportsReportStatistics with 
Logging {
+trait FileScan extends Scan
+  with Batch with SupportsReportStatistics with Logging with SupportsMetadata {

Review comment:
   super nit: better to put `Logging` in the end? `with Batch with 
SupportsReportStatistics with SupportsMetadata  with Logging {`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622240424







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418410243



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala
##
@@ -46,6 +47,35 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
 Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result)
   }
 
+  /**
+   * Shorthand for calling redactString() without specifying redacting rules
+   */
+  protected def redact(text: String): String = {
+Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text)
+  }
+
+

Review comment:
   super nit: remove the single blank line.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622240424







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418410056



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala
##
@@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.internal.connector
+
+trait SupportsMetadata {
+  def getMetaData(): Map[String, String]
+}

Review comment:
   We don't need to move this file into the java side along with `Batch` 
and `SupportsReportStatistics`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28365: [SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28365:
URL: https://github.com/apache/spark/pull/28365#discussion_r418409626



##
File path: R/pkg/R/DataFrame.R
##
@@ -2604,18 +2603,17 @@ setMethod("join",
   if (is.null(joinType)) {
 sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc)
   } else {
-if (joinType %in% c("inner", "cross",
+valid_join_types <- c("inner", "cross",

Review comment:
   nit: let's keep the naming rule here -> `validJoinTypes`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


SparkQA commented on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622240171


   **[Test build #122156 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)**
 for PR 28348 at commit 
[`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables

2020-04-30 Thread GitBox


huaxingao commented on pull request #28348:
URL: https://github.com/apache/spark/pull/28348#issuecomment-622239789


   If there are no further comments, could one of you please merge this PR for 
me? Thanks a lot!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418409102



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala
##
@@ -65,4 +65,8 @@ case class AvroScan(
   }
 
   override def hashCode(): Int = super.hashCode()
+
+  override def getMetaData(): Map[String, String] = {
+super.metaData ++ Map("Format" -> "avro")

Review comment:
   Could we move all the `Format` metadata into the trait-side?
   ```
"Format" -> s"${this.getClass.getSimpleName.replace("Scan", 
"").toLowerCase(Locale.ROOT)}"
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418409233



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala
##
@@ -65,4 +65,8 @@ case class AvroScan(
   }
 
   override def hashCode(): Int = super.hashCode()
+
+  override def getMetaData(): Map[String, String] = {
+super.metaData ++ Map("Format" -> "avro")

Review comment:
   Also, could we check expain output for Avro V2 scan?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28383:
URL: https://github.com/apache/spark/pull/28383#discussion_r418408328



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
##
@@ -119,6 +119,10 @@ case class OptimizeMetadataOnlyQuery(catalog: 
SessionCatalog) extends Rule[Logic
   }
 }
 
+if 
(normalizedFilters.exists(_.find(_.isInstanceOf[Unevaluable]).isDefined)) {
+  return child
+}

Review comment:
   Why don't you just filter out subquerties consistently with other 
normalized filters, by 
`normalizedFilters.filterNot(SubqueryExpression.hasSubquery)`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


beliefer commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418405930



##
File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala
##
@@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with 
DisableAdaptiveExecutionSuite
   assert(getNormalizedExplain(df1, FormattedMode) === 
getNormalizedExplain(df2, FormattedMode))
 }
   }
+
+  test("Explain formatted output for scan operator for datasource V2") {
+withTempDir { dir =>
+  Seq("parquet", "orc", "csv", "json").foreach { format =>
+val basePath = dir.getCanonicalPath + "/" + format
+val expected_plan_fragment =
+  s"""
+ |\\(1\\) BatchScan
+ |Output \\[2\\]: \\[value#x, id#x\\]
+ |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
+ |Format: $format
+ |Location: InMemoryFileIndex\\[.*\\]
+ |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
+ |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]

Review comment:
   Could extract this line as variable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


beliefer commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418405736



##
File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala
##
@@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with 
DisableAdaptiveExecutionSuite
   assert(getNormalizedExplain(df1, FormattedMode) === 
getNormalizedExplain(df2, FormattedMode))
 }
   }
+
+  test("Explain formatted output for scan operator for datasource V2") {
+withTempDir { dir =>
+  Seq("parquet", "orc", "csv", "json").foreach { format =>
+val basePath = dir.getCanonicalPath + "/" + format
+val expected_plan_fragment =
+  s"""
+ |\\(1\\) BatchScan
+ |Output \\[2\\]: \\[value#x, id#x\\]
+ |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
+ |Format: $format
+ |Location: InMemoryFileIndex\\[.*\\]
+ |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
+ |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]
+ |ReadSchema: struct\\
+ |""".stripMargin.trim
+

Review comment:
   ```
   Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]"),
   ("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]"),
   ("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"),
   ("json", "")).foreach { (format, pushedFilters) =>
  val expected_plan_fragment =
  s"""
 |\\(1\\) BatchScan
 |Output \\[2\\]: \\[value#x, id#x\\]
 |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
 |Format: $format
 |Location: InMemoryFileIndex\\[.*\\]
 |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
 |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]
 ${if (pushedFilters.nonEmpty) "|PushedFilers..."}
 |ReadSchema: struct\\
 |""".stripMargin.trim
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread GitBox


beliefer commented on a change in pull request #28425:
URL: https://github.com/apache/spark/pull/28425#discussion_r418405736



##
File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala
##
@@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with 
DisableAdaptiveExecutionSuite
   assert(getNormalizedExplain(df1, FormattedMode) === 
getNormalizedExplain(df2, FormattedMode))
 }
   }
+
+  test("Explain formatted output for scan operator for datasource V2") {
+withTempDir { dir =>
+  Seq("parquet", "orc", "csv", "json").foreach { format =>
+val basePath = dir.getCanonicalPath + "/" + format
+val expected_plan_fragment =
+  s"""
+ |\\(1\\) BatchScan
+ |Output \\[2\\]: \\[value#x, id#x\\]
+ |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
+ |Format: $format
+ |Location: InMemoryFileIndex\\[.*\\]
+ |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
+ |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]
+ |ReadSchema: struct\\
+ |""".stripMargin.trim
+

Review comment:
   Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]"),
   ("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]"),
   ("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"),
   ("json", "")).foreach { (format, pushedFilters) =>
  val expected_plan_fragment =
  s"""
 |\\(1\\) BatchScan
 |Output \\[2\\]: \\[value#x, id#x\\]
 |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\]
 |Format: $format
 |Location: InMemoryFileIndex\\[.*\\]
 |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\]
 |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), 
.*\\(value,2\\)\\]
 ${if (pushedFilters.nonEmpty) "|PushedFilers..."}
 |ReadSchema: struct\\
 |""".stripMargin.trim





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418402122



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+

Review comment:
   Why is this file md specifically?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins removed a comment on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622232100







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418401998



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+
+## Summary

Review comment:
   Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a 
main repo if this is a documentation purpose.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


maropu commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622232046


   Thanks for the quick response, @HyukjinKwon ;)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418401998



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+
+## Summary

Review comment:
   Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a 
main repo if this is a documentation purpose.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


HyukjinKwon commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418401998



##
File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md
##
@@ -0,0 +1,341 @@
+
+## Summary

Review comment:
   Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a 
main repo.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


AmplabJenkins commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622232100







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite

2020-04-30 Thread GitBox


SparkQA commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622231819


   **[Test build #122155 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)**
 for PR 28427 at commit 
[`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


maropu commented on a change in pull request #28194:
URL: https://github.com/apache/spark/pull/28194#discussion_r418401753



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.tags.ExtendedSQLTest
+
+// scalastyle:off line.size.limit
+/**
+ * End-to-end test cases for SQL schemas of expression examples.
+ * The golden result file is 
"spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md".
+ *
+ * To run the entire test suite:
+ * {{{
+ *   build/sbt "sql/test-only *ExpressionsSchemaSuite"
+ * }}}
+ *
+ * To re-generate golden files for entire suite, run:
+ * {{{
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only 
*ExpressionsSchemaSuite"
+ * }}}
+ *
+ * For example:
+ * {{{
+ *   ...
+ *   @ExpressionDescription(
+ * usage = "_FUNC_(str, n) - Returns the string which repeats the given 
string value n times.",
+ * examples = """
+ *   Examples:
+ * > SELECT _FUNC_('123', 2);
+ *  123123
+ * """,
+ * since = "1.5.0")
+ *   case class StringRepeat(str: Expression, times: Expression)
+ *   ...
+ * }}}
+ *
+ * The format for golden result files look roughly like:
+ * {{{
+ *   ...
+ *   | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | 
SELECT repeat('123', 2) | struct |
+ *   ...
+ * }}}
+ */
+// scalastyle:on line.size.limit
+@ExtendedSQLTest
+class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession {
+
+  private val regenerateGoldenFiles: Boolean = 
System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1"
+
+  private val baseResourcePath = {
+// We use a path based on Spark home for 2 reasons:
+//   1. Maven can't get correct resource directory when resources in other 
jars.
+//   2. We test subclasses in the hive-thriftserver module.
+val sparkHome = {
+  assert(sys.props.contains("spark.test.home") ||
+sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not 
set.")
+  sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
+}
+
+java.nio.file.Paths.get(sparkHome,
+  "sql", "core", "src", "test", "resources", "sql-functions").toFile
+  }
+
+  private val resultFile = new File(baseResourcePath, 
"sql-expression-schema.md")
+
+  /** A single SQL query's SQL and schema. */
+  protected case class QueryOutput(
+  className: String,
+  funcName: String,
+  sql: String = "N/A",
+  schema: String = "N/A") {
+override def toString: String = {
+  s"| $className | $funcName | $sql | $schema |"
+}
+  }
+
+  test("Check schemas for expression examples") {
+val exampleRe = """^(.+);\n(?s)(.+)$""".r
+val funInfos = spark.sessionState.functionRegistry.listFunction().map { 
funcId =>
+  spark.sessionState.catalog.lookupFunctionInfo(funcId)
+}
+
+val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1)
+val outputBuffer = new ArrayBuffer[String]
+val outputs = new ArrayBuffer[QueryOutput]
+val missingExamples = new ArrayBuffer[String]
+
+classFunsMap.foreach { kv =>
+  val className = kv._1
+  kv._2.foreach { funInfo =>
+val example = funInfo.getExamples
+val funcName = funInfo.getName.replaceAll("\\|", "")
+if (example == "") {
+  val queryOutput = QueryOutput(className, funcName)
+  outputBuffer += queryOutput.toString
+  outputs += queryOutput
+  missingExamples += funcName
+}
+
+// If expression exists 'Examples' segment, the first element is 
'Examples'. Because
+// this test case is only used to print aliases of expressions for 
double checking.
+// Therefore, we only need to output the first SQL and its 
corresponding schema.
+// Note: We need to filter out the commands that set the parameters, 
such as:
+// SET 

[GitHub] [spark] HyukjinKwon commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group

2020-04-30 Thread GitBox


HyukjinKwon commented on pull request #28395:
URL: https://github.com/apache/spark/pull/28395#issuecomment-622231583


   Ah, it was already commented at https://github.com/apache/spark/pull/28194 
:-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group

2020-04-30 Thread GitBox


HyukjinKwon commented on pull request #28395:
URL: https://github.com/apache/spark/pull/28395#issuecomment-622231048


   Thanks for letting me know. I will take a look too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of byExpressionsSchemaSuite

2020-04-30 Thread GitBox


maropu commented on pull request #28427:
URL: https://github.com/apache/spark/pull/28427#issuecomment-622230995


   cc: @dongjoon-hyun 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu opened a new pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of byExpressionsSchemaSuite

2020-04-30 Thread GitBox


maropu opened a new pull request #28427:
URL: https://github.com/apache/spark/pull/28427


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


maropu commented on pull request #28194:
URL: https://github.com/apache/spark/pull/28194#issuecomment-622230838


   https://github.com/apache/spark/pull/28427



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.

2020-04-30 Thread GitBox


maropu commented on pull request #28194:
URL: https://github.com/apache/spark/pull/28194#issuecomment-622230902


   To recover 3.0 asap, I opened a PR to fix it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group

2020-04-30 Thread GitBox


dongjoon-hyun edited a comment on pull request #28395:
URL: https://github.com/apache/spark/pull/28395#issuecomment-69256


   Thank you. The follow-up looks good. BTW, FYI, `branch-3.0` UT has been 
broken by another commit.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group

2020-04-30 Thread GitBox


dongjoon-hyun commented on pull request #28395:
URL: https://github.com/apache/spark/pull/28395#issuecomment-69256


   Thank you. The follow-up looks good. BTW, `branch-3.0` UT is broken now by 
another commit. :(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >