[GitHub] [spark] AmplabJenkins removed a comment on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart
AmplabJenkins removed a comment on pull request #24173: URL: https://github.com/apache/spark/pull/24173#issuecomment-622258664 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart
AmplabJenkins commented on pull request #24173: URL: https://github.com/apache/spark/pull/24173#issuecomment-622258664 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart
SparkQA commented on pull request #24173: URL: https://github.com/apache/spark/pull/24173#issuecomment-622258120 **[Test build #122152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122152/testReport)** for PR 24173 at commit [`1fcfff5`](https://github.com/apache/spark/commit/1fcfff5c2ca78049eb38cf4ef7c041d0005ab9b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #24173: [SPARK-27237][SS] Introduce State schema validation among query restart
SparkQA removed a comment on pull request #24173: URL: https://github.com/apache/spark/pull/24173#issuecomment-622196758 **[Test build #122152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122152/testReport)** for PR 24173 at commit [`1fcfff5`](https://github.com/apache/spark/commit/1fcfff5c2ca78049eb38cf4ef7c041d0005ab9b3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
maropu commented on a change in pull request #28427: URL: https://github.com/apache/spark/pull/28427#discussion_r418424358 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -206,9 +204,6 @@ | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | struct | | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT quarter('2016-08-31') | struct | | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> | -| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | struct | -| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | struct | -| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | struct | Review comment: Yea, I think this golden file was generated in master, so I re-generated it in branch-3.0. Btw, we need this test suite in branch-3.0? Reverting #28194 from branch-3.0 is another simple solution to recover it, I think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
AmplabJenkins removed a comment on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622256704 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
AmplabJenkins commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622256704 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
SparkQA commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622256412 **[Test build #122161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122161/testReport)** for PR 28383 at commit [`c848508`](https://github.com/apache/spark/commit/c8485086ffee7e5ca2ad7a1e24e10081070d135d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathyaprakashg commented on pull request #28222: SPARK-31447 Fix issue in ExtractIntervalPart expression
sathyaprakashg commented on pull request #28222: URL: https://github.com/apache/spark/pull/28222#issuecomment-622256424 @cloud-fan @yaooqinn for great discussion on this. @yaooqinn has fixed getDays in the below PR and for getMonths we can agree on keeping the existing behavior as appropriate. https://github.com/apache/spark/pull/28396 I have another quick question, before we close this PR. SubtractTimestamps sems to be always populating only microseconds of CalendarInterval https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2242 Whereas SubtractDates seems to be populating months along with days field. https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2261 https://github.com/apache/spark/blob/ea525fe8c0cc6336a7ba8d98bada3198795f8aed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L940 My question is whether we should change SubtractTimestamps also to populate appropraiate months and days fields as well in CalendarInterval or whether you guys feel existing implementation is appropriate This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch
AmplabJenkins removed a comment on pull request #27649: URL: https://github.com/apache/spark/pull/27649#issuecomment-622255774 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch
AmplabJenkins commented on pull request #27649: URL: https://github.com/apache/spark/pull/27649#issuecomment-622255774 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch
SparkQA removed a comment on pull request #27649: URL: https://github.com/apache/spark/pull/27649#issuecomment-622196761 **[Test build #122151 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122151/testReport)** for PR 27649 at commit [`6406e36`](https://github.com/apache/spark/commit/6406e36eb34377983aaf113495ca16b1553317a3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
AmplabJenkins removed a comment on pull request #28420: URL: https://github.com/apache/spark/pull/28420#issuecomment-622254976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27649: [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch
SparkQA commented on pull request #27649: URL: https://github.com/apache/spark/pull/27649#issuecomment-622255280 **[Test build #122151 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122151/testReport)** for PR 27649 at commit [`6406e36`](https://github.com/apache/spark/commit/6406e36eb34377983aaf113495ca16b1553317a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
AmplabJenkins commented on pull request #28420: URL: https://github.com/apache/spark/pull/28420#issuecomment-622254976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
SparkQA commented on pull request #28420: URL: https://github.com/apache/spark/pull/28420#issuecomment-622254725 **[Test build #122160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122160/testReport)** for PR 28420 at commit [`f6bc861`](https://github.com/apache/spark/commit/f6bc8611e04255c575799bd31effeadee13d0f8f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
beliefer commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418417068 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile} +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.tags.ExtendedSQLTest + +// scalastyle:off line.size.limit +/** + * End-to-end test cases for SQL schemas of expression examples. + * The golden result file is "spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md". + * + * To run the entire test suite: + * {{{ + * build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * To re-generate golden files for entire suite, run: + * {{{ + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * For example: + * {{{ + * ... + * @ExpressionDescription( + * usage = "_FUNC_(str, n) - Returns the string which repeats the given string value n times.", + * examples = """ + * Examples: + * > SELECT _FUNC_('123', 2); + * 123123 + * """, + * since = "1.5.0") + * case class StringRepeat(str: Expression, times: Expression) + * ... + * }}} + * + * The format for golden result files look roughly like: + * {{{ + * ... + * | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | SELECT repeat('123', 2) | struct | + * ... + * }}} + */ +// scalastyle:on line.size.limit +@ExtendedSQLTest +class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession { + + private val regenerateGoldenFiles: Boolean = System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1" + + private val baseResourcePath = { +// We use a path based on Spark home for 2 reasons: +// 1. Maven can't get correct resource directory when resources in other jars. +// 2. We test subclasses in the hive-thriftserver module. +val sparkHome = { + assert(sys.props.contains("spark.test.home") || +sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not set.") + sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME")) +} + +java.nio.file.Paths.get(sparkHome, + "sql", "core", "src", "test", "resources", "sql-functions").toFile + } + + private val resultFile = new File(baseResourcePath, "sql-expression-schema.md") + + /** A single SQL query's SQL and schema. */ + protected case class QueryOutput( + className: String, + funcName: String, + sql: String = "N/A", + schema: String = "N/A") { +override def toString: String = { + s"| $className | $funcName | $sql | $schema |" +} + } + + test("Check schemas for expression examples") { +val exampleRe = """^(.+);\n(?s)(.+)$""".r +val funInfos = spark.sessionState.functionRegistry.listFunction().map { funcId => + spark.sessionState.catalog.lookupFunctionInfo(funcId) +} + +val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1) +val outputBuffer = new ArrayBuffer[String] +val outputs = new ArrayBuffer[QueryOutput] +val missingExamples = new ArrayBuffer[String] + +classFunsMap.foreach { kv => + val className = kv._1 + kv._2.foreach { funInfo => +val example = funInfo.getExamples +val funcName = funInfo.getName.replaceAll("\\|", "") +if (example == "") { + val queryOutput = QueryOutput(className, funcName) + outputBuffer += queryOutput.toString + outputs += queryOutput + missingExamples += funcName +} + +// If expression exists 'Examples' segment, the first element is 'Examples'. Because +// this test case is only used to print aliases of expressions for double checking. +// Therefore, we only need to output the first SQL and its corresponding schema. +// Note: We need to filter out the commands that set the parameters, such as: +// SET
[GitHub] [spark] viirya commented on a change in pull request #28351: [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements
viirya commented on a change in pull request #28351: URL: https://github.com/apache/spark/pull/28351#discussion_r418421293 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala ## @@ -139,6 +148,32 @@ case class CollectSet( def this(child: Expression) = this(child, 0, 0) + /* + * SPARK-31500 + * Array[Byte](BinaryType) Scala equality don't works as expected + * so HashSet return duplicates, we have to change types to drop + * this duplicates and make collect_set work as expected for this + * data type + */ + override lazy val bufferElementType = child.dataType match { +case BinaryType => ArrayType(BinaryType) Review comment: ArrayType(ByteType)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName
AmplabJenkins removed a comment on pull request #28164: URL: https://github.com/apache/spark/pull/28164#issuecomment-622253287 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName
AmplabJenkins commented on pull request #28164: URL: https://github.com/apache/spark/pull/28164#issuecomment-622253287 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu removed a comment on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
maropu removed a comment on pull request #28420: URL: https://github.com/apache/spark/pull/28420#issuecomment-622252926 I feel having the pretty functionality in `RuntimeReplaceable` is a bit intrusive. Dropping the idea above is okay, but I think we need a more general solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
maropu commented on a change in pull request #28420: URL: https://github.com/apache/spark/pull/28420#discussion_r418348605 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ## @@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { // two `RuntimeReplaceable` are considered to be semantically equal if their "child" expressions // are semantically equal. override lazy val canonicalized: Expression = child.canonicalized + + override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" + +" innerChildren with the original parameters") + + protected val sqlStrSeparator: String = ", " + + override def sql: String = RuntimeReplaceable.this.prettyName + +prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")") + + protected var prettyChildren: Seq[Expression] = innerChildren Review comment: I think its better not to use `var` where possible, so how about it like this? ``` // RuntimeReplaceable def newInstance(innerChildren: Seq[Expression]): RuntimeReplaceable // e.g., ParseToTimestamp override def newInstance(innerChildren: Seq[Expression]): ParseToTimestamp = { innerChildren match { case Seq(left) => ParseToTimestamp(left, None, child) case Seq(left, format) => ParseToTimestamp(left, Some(format), child) } } ``` Then, update the `toPrettySQL` in `utils.package`; ``` def toPrettySQL(e: Expression): String = e match { case r: RuntimeReplaceable => r.newInstance(r.innerChildren.map(usePrettyExpression)).sql case _ => usePrettyExpression(e).sql } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28164: [SPARK-31393][SQL] Show the correct alias in a more elegant way that override the prettyName
SparkQA commented on pull request #28164: URL: https://github.com/apache/spark/pull/28164#issuecomment-622252974 **[Test build #122159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122159/testReport)** for PR 28164 at commit [`f048601`](https://github.com/apache/spark/commit/f048601421d12ada878880c6a71f43b1bff3adf3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
HyukjinKwon commented on a change in pull request #28427: URL: https://github.com/apache/spark/pull/28427#discussion_r418420692 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -206,9 +204,6 @@ | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | struct | | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT quarter('2016-08-31') | struct | | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> | -| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | struct | -| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | struct | -| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | struct | Review comment: I think @beliefer meant https://github.com/apache/spark/pull/28392 is only landed into master and we should keep these just as are at https://github.com/apache/spark/pull/28194. I assume `rand(seed)` was printed out because this test was ran against the latest `master` instead of `branch-3.0` at https://github.com/apache/spark/pull/28427/commits/948a54e7059b29a1a001315b0ed2d69ec4104314 (?). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins removed a comment on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622252767 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/122155/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
SparkQA removed a comment on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622231819 **[Test build #122155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)** for PR 28427 at commit [`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622252763 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
maropu commented on pull request #28420: URL: https://github.com/apache/spark/pull/28420#issuecomment-622252926 I feel having the pretty functionality in `RuntimeReplaceable` is a bit intrusive. Dropping the idea above is okay, but I think we need a more general solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins removed a comment on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622252763 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
HyukjinKwon commented on a change in pull request #28427: URL: https://github.com/apache/spark/pull/28427#discussion_r418420692 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -206,9 +204,6 @@ | org.apache.spark.sql.catalyst.expressions.Pow | power | SELECT power(2, 3) | struct | | org.apache.spark.sql.catalyst.expressions.Quarter | quarter | SELECT quarter('2016-08-31') | struct | | org.apache.spark.sql.catalyst.expressions.RLike | rlike | SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' | struct<%SystemDrive%UsersJohn RLIKE %SystemDrive%\Users.*:boolean> | -| org.apache.spark.sql.catalyst.expressions.Rand | random | SELECT random() | struct | -| org.apache.spark.sql.catalyst.expressions.Rand | rand | SELECT rand() | struct | -| org.apache.spark.sql.catalyst.expressions.Randn | randn | SELECT randn() | struct | Review comment: I think @beliefer meant https://github.com/apache/spark/pull/28392 is only landed into master and we should keep these just as are at https://github.com/apache/spark/pull/28194. I assume this seed were printed out because this test was ran against the latest `master` instead of `branch-3.0` at https://github.com/apache/spark/pull/28427/commits/948a54e7059b29a1a001315b0ed2d69ec4104314 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
maropu commented on a change in pull request #28420: URL: https://github.com/apache/spark/pull/28420#discussion_r418420802 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ## @@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { // two `RuntimeReplaceable` are considered to be semantically equal if their "child" expressions // are semantically equal. override lazy val canonicalized: Expression = child.canonicalized + + override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" + +" innerChildren with the original parameters") + + protected val sqlStrSeparator: String = ", " + + override def sql: String = RuntimeReplaceable.this.prettyName + +prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")") + + protected var prettyChildren: Seq[Expression] = innerChildren Review comment: I feel having the pretty functionality in `RuntimeReplaceable` is a bit intrusive. Dropping the idea above is okay, but I think we need a more general solution for getting pretty strings outside `RuntimeReplaceable`. WDYT? @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
SparkQA commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622252679 **[Test build #122155 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)** for PR 28427 at commit [`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins removed a comment on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622251633 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622251633 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
SparkQA commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622251397 **[Test build #122158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122158/testReport)** for PR 28427 at commit [`db16f76`](https://github.com/apache/spark/commit/db16f762318197eb82ed962bc89daf3d4749ac80). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418419246 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + Review comment: Hopefully we can improve the diff here ... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28351: [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements
viirya commented on a change in pull request #28351: URL: https://github.com/apache/spark/pull/28351#discussion_r418418880 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ## @@ -530,6 +530,26 @@ class DataFrameAggregateSuite extends QueryTest ) } + test("SPARK-31500: collect_set() of BinaryType returns duplicate elements") { +val bytesTest1 = "test1".getBytes +val bytesTest2 = "test2".getBytes +val df = Seq(bytesTest1, bytesTest1, bytesTest2).toDF("a") +val ret = df.select(collect_set($"a")).collect() + .map(r => r.getAs[Seq[_]](0)).head +assert(ret.length == 2) + +val a = "aa".getBytes +val b = "bb".getBytes +val c = "cc".getBytes +val d = "dd".getBytes +val df1 = Seq((a, b), (a, b), (c, d)) + .toDF("x", "y") + .select(struct($"x", $"y").as("a")) +val ret1 = df1.select(collect_set($"a")).collect() + .map(r => r.getAs[Seq[_]](0)).head +assert(ret1.length == 2) Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418418796 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + Review comment: Okay, but seems it's more difficult to track the diff. See https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3/431/testReport/junit/org.apache.spark.sql/ExpressionsSchemaSuite/Check_schemas_for_expression_examples/ ![Screen Shot 2020-05-01 at 2 06 10 PM](https://user-images.githubusercontent.com/6477701/80783491-f5a7a400-8bb4-11ea-9e91-880719bd472c.png) ![Screen Shot 2020-05-01 at 2 06 04 PM](https://user-images.githubusercontent.com/6477701/80783494-f7716780-8bb4-11ea-9425-dfc348644649.png) ![Screen Shot 2020-05-01 at 2 05 58 PM](https://user-images.githubusercontent.com/6477701/80783495-f8a29480-8bb4-11ea-8b7a-eb16cbcc9a8e.png) ![Screen Shot 2020-05-01 at 2 05 48 PM](https://user-images.githubusercontent.com/6477701/80783498-f9d3c180-8bb4-11ea-82b4-6f865294a21f.png) ![Screen Shot 2020-05-01 at 2 05 42 PM](https://user-images.githubusercontent.com/6477701/80783501-fa6c5800-8bb4-11ea-9ce6-f562aebf6b6c.png) ![Screen Shot 2020-05-01 at 2 05 35 PM](https://user-images.githubusercontent.com/6477701/80783504-fb9d8500-8bb4-11ea-9a9a-f0e5eadcb596.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
yaooqinn commented on a change in pull request #28420: URL: https://github.com/apache/spark/pull/28420#discussion_r418418394 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ## @@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { // two `RuntimeReplaceable` are considered to be semantically equal if their "child" expressions // are semantically equal. override lazy val canonicalized: Expression = child.canonicalized + + override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" + +" innerChildren with the original parameters") + + protected val sqlStrSeparator: String = ", " + + override def sql: String = RuntimeReplaceable.this.prettyName + +prettyChildren.map(_.sql).mkString("(", sqlStrSeparator, ")") + + protected var prettyChildren: Seq[Expression] = innerChildren Review comment: The `newInstance` seems not robust to make a clone with `Seq[Expression]` I think we can just add `def prettySQL: String` in RuntimeReplaceable for `utils.package` to call, and override it wherever needed without creating new instances. Seems a better and simpler idea that comes up based on your suggestions, WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on a change in pull request #28383: URL: https://github.com/apache/spark/pull/28383#discussion_r418417828 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala ## @@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-31590 The filter used by Metadata-only queries should not have Unevaluable") { +withTable("test_tbl") { + withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") { +sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") Review comment: Can we reuse `testMetadataOnly`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
beliefer commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418417068 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile} +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.tags.ExtendedSQLTest + +// scalastyle:off line.size.limit +/** + * End-to-end test cases for SQL schemas of expression examples. + * The golden result file is "spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md". + * + * To run the entire test suite: + * {{{ + * build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * To re-generate golden files for entire suite, run: + * {{{ + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * For example: + * {{{ + * ... + * @ExpressionDescription( + * usage = "_FUNC_(str, n) - Returns the string which repeats the given string value n times.", + * examples = """ + * Examples: + * > SELECT _FUNC_('123', 2); + * 123123 + * """, + * since = "1.5.0") + * case class StringRepeat(str: Expression, times: Expression) + * ... + * }}} + * + * The format for golden result files look roughly like: + * {{{ + * ... + * | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | SELECT repeat('123', 2) | struct | + * ... + * }}} + */ +// scalastyle:on line.size.limit +@ExtendedSQLTest +class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession { + + private val regenerateGoldenFiles: Boolean = System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1" + + private val baseResourcePath = { +// We use a path based on Spark home for 2 reasons: +// 1. Maven can't get correct resource directory when resources in other jars. +// 2. We test subclasses in the hive-thriftserver module. +val sparkHome = { + assert(sys.props.contains("spark.test.home") || +sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not set.") + sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME")) +} + +java.nio.file.Paths.get(sparkHome, + "sql", "core", "src", "test", "resources", "sql-functions").toFile + } + + private val resultFile = new File(baseResourcePath, "sql-expression-schema.md") + + /** A single SQL query's SQL and schema. */ + protected case class QueryOutput( + className: String, + funcName: String, + sql: String = "N/A", + schema: String = "N/A") { +override def toString: String = { + s"| $className | $funcName | $sql | $schema |" +} + } + + test("Check schemas for expression examples") { +val exampleRe = """^(.+);\n(?s)(.+)$""".r +val funInfos = spark.sessionState.functionRegistry.listFunction().map { funcId => + spark.sessionState.catalog.lookupFunctionInfo(funcId) +} + +val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1) +val outputBuffer = new ArrayBuffer[String] +val outputs = new ArrayBuffer[QueryOutput] +val missingExamples = new ArrayBuffer[String] + +classFunsMap.foreach { kv => + val className = kv._1 + kv._2.foreach { funInfo => +val example = funInfo.getExamples +val funcName = funInfo.getName.replaceAll("\\|", "") +if (example == "") { + val queryOutput = QueryOutput(className, funcName) + outputBuffer += queryOutput.toString + outputs += queryOutput + missingExamples += funcName +} + +// If expression exists 'Examples' segment, the first element is 'Examples'. Because +// this test case is only used to print aliases of expressions for double checking. +// Therefore, we only need to output the first SQL and its corresponding schema. +// Note: We need to filter out the commands that set the parameters, such as: +// SET
[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
maropu commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622248576 Ah, yes. We need to ignore them in branch-3.0. I'll update. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on a change in pull request #28383: URL: https://github.com/apache/spark/pull/28383#discussion_r418415403 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala ## @@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-31590 The filter used by Metadata-only queries should not have Unevaluable") { +withTable("test_tbl") { + withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") { +sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") Review comment: Can you make the test case minimised, and consistent with the style used in this file? I think you can create the partitioned table via `write.parquetby` syntax instead of relying on the SQL syntax here even when you create tables. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622248593 Looks fine to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
beliefer commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622248015 @maropu, It seems we not change the schema of rand() in branch-3.0? Why we preserve the distinguish between master and branch-3.0 ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on a change in pull request #28383: URL: https://github.com/apache/spark/pull/28383#discussion_r418415403 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala ## @@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-31590 The filter used by Metadata-only queries should not have Unevaluable") { +withTable("test_tbl") { + withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") { +sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") Review comment: Can you make the test case minimised, and consistent with the style used in this file? I think you can create the partitioned table via `write.parquetby` syntax instead of relying on the SQL syntax here even when you create tables. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on a change in pull request #28383: URL: https://github.com/apache/spark/pull/28383#discussion_r418415403 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala ## @@ -150,4 +150,30 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-31590 The filter used by Metadata-only queries should not have Unevaluable") { +withTable("test_tbl") { + withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") { +sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") Review comment: Can you make the test case minimised, and consistent with the style used in this file? I think you can create the partitioned table via `write.parquetby` syntax instead of relying on the SQL syntax here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #26624: [SPARK-8981][core] Add MDC support in Executor
cloud-fan commented on a change in pull request #26624: URL: https://github.com/apache/spark/pull/26624#discussion_r418414716 ## File path: core/src/main/scala/org/apache/spark/executor/Executor.scala ## @@ -674,6 +677,18 @@ private[spark] class Executor( } } + private def setMDCForTask(taskDescription: TaskDescription): Unit = { +val properties = taskDescription.properties + +org.slf4j.MDC.put("appId", properties.getProperty("spark.app.id")) Review comment: And where do we set app id/name to `task.localProperties`? Do users need to do it manually? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418414748 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala ## @@ -46,6 +47,35 @@ trait DataSourceV2ScanExecBase extends LeafExecNode { Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result) } + /** + * Shorthand for calling redactString() without specifying redacting rules + */ + protected def redact(text: String): String = { +Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text) + } + + + override def verboseStringWithOperatorId(): String = { +val metaDataStr = if (scan.isInstanceOf[SupportsMetadata]) { + val scanWithExplainSupport: SupportsMetadata = scan.asInstanceOf[SupportsMetadata] + val metadataStr = scanWithExplainSupport.getMetaData.toSeq.sorted.filterNot { +case (_, value) if (value.isEmpty || value.equals("[]")) => true +case (_, _) => false + }.map { +case (key, value) => s"$key: ${redact(value)}" + } Review comment: nit: format ``` val metaDataStr = scan match { case s: SupportsMetadata => s.getMetaData().toSeq.sorted.flatMap { case (key, value) if value.isEmpty || value.equals("[]") => Some(s"$key: ${redact(value)}") case _ => None } case _ => Seq(scan.description()) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #28420: [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
yaooqinn commented on a change in pull request #28420: URL: https://github.com/apache/spark/pull/28420#discussion_r418414564 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala ## @@ -323,6 +323,20 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { // two `RuntimeReplaceable` are considered to be semantically equal if their "child" expressions // are semantically equal. override lazy val canonicalized: Expression = child.canonicalized + + override def innerChildren: Seq[Expression] = sys.error("RuntimeReplaceable must implement" + Review comment: Good Point! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28406: [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
cloud-fan commented on a change in pull request #28406: URL: https://github.com/apache/spark/pull/28406#discussion_r418414225 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ## @@ -342,6 +380,43 @@ public void readLongs( } } + // A fork of `readLongs`, which rebases the timestamp long value (microseconds) before filling + // the Spark column vector. + public void readLongsWithRebase( + int total, + WritableColumnVector c, + int rowId, + int level, + VectorizedValuesReader data) throws IOException { +int left = total; +while (left > 0) { + if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + if (currentValue == level) { +data.readLongsWithRebase(n, c, rowId); + } else { +c.putNulls(rowId, n); + } + break; +case PACKED: Review comment: The general idea is to add an extra loop to check if we need to rebase or not, and it's only worthwhile if the no-rebase code path is much faster than the rebase code path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28406: [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
cloud-fan commented on a change in pull request #28406: URL: https://github.com/apache/spark/pull/28406#discussion_r418414012 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java ## @@ -342,6 +380,43 @@ public void readLongs( } } + // A fork of `readLongs`, which rebases the timestamp long value (microseconds) before filling + // the Spark column vector. + public void readLongsWithRebase( + int total, + WritableColumnVector c, + int rowId, + int level, + VectorizedValuesReader data) throws IOException { +int left = total; +while (left > 0) { + if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + if (currentValue == level) { +data.readLongsWithRebase(n, c, rowId); + } else { +c.putNulls(rowId, n); + } + break; +case PACKED: Review comment: I didn't optimize this case because the no-rebase code path looks not very fast. It has a `if-else` in the loop. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
maropu commented on pull request #27233: URL: https://github.com/apache/spark/pull/27233#issuecomment-622244872 Could we also close the jira, too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
cloud-fan commented on pull request #27233: URL: https://github.com/apache/spark/pull/27233#issuecomment-622244255 To put a conclusion: I think this PR does fix a "correctness" issue according to the SQL standard. But as @tgravescs said in https://github.com/apache/spark/pull/27233#issuecomment-578187736 , the current behavior looks reasonable as well, and is the same with Oracle/SQL Server. This is a very corner case, and most likely people don't care/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan edited a comment on pull request #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
cloud-fan edited a comment on pull request #27233: URL: https://github.com/apache/spark/pull/27233#issuecomment-622244255 To put a conclusion: I think this PR does fix a "correctness" issue according to the SQL standard. But as @tgravescs said in https://github.com/apache/spark/pull/27233#issuecomment-578187736 , the current behavior looks reasonable as well, and is the same with Oracle/SQL Server. This is a very corner case, and most likely people don't care. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418413096 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala ## @@ -0,0 +1,21 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.internal.connector + +trait SupportsMetadata { Review comment: Plz add some comment about what this class is used for? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
AmplabJenkins removed a comment on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622243652 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418409102 ## File path: external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala ## @@ -65,4 +65,8 @@ case class AvroScan( } override def hashCode(): Int = super.hashCode() + + override def getMetaData(): Map[String, String] = { +super.metaData ++ Map("Format" -> "avro") Review comment: Could we move all the `Format` metadata into the `FileScan.metadata` side? ``` "Format" -> s"${this.getClass.getSimpleName.replace("Scan", "").toLowerCase(Locale.ROOT)}" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
AmplabJenkins commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622243652 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
AmplabJenkins removed a comment on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622243283 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
SparkQA removed a comment on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622240171 **[Test build #122156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)** for PR 28348 at commit [`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
AmplabJenkins commented on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622243283 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
SparkQA commented on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622243201 **[Test build #122156 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)** for PR 28348 at commit [`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
SparkQA commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622243294 **[Test build #122157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122157/testReport)** for PR 28383 at commit [`86f28d5`](https://github.com/apache/spark/commit/86f28d51f343becc6a516f726d20efd4d12c7708). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
cloud-fan commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418412131 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + Review comment: See https://github.com/apache/spark/pull/28194#issuecomment-613854769 I guess it's easier for humans the read the table. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
AmplabJenkins removed a comment on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-620372971 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on pull request #28383: URL: https://github.com/apache/spark/pull/28383#issuecomment-622241833 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418410438 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala ## @@ -29,11 +29,13 @@ import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjectio import org.apache.spark.sql.connector.read.{Batch, InputPartition, Scan, Statistics, SupportsReportStatistics} import org.apache.spark.sql.execution.PartitionedFileUtil import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.internal.connector.SupportsMetadata import org.apache.spark.sql.sources.Filter import org.apache.spark.sql.types.StructType import org.apache.spark.util.Utils -trait FileScan extends Scan with Batch with SupportsReportStatistics with Logging { +trait FileScan extends Scan + with Batch with SupportsReportStatistics with Logging with SupportsMetadata { Review comment: super nit: better to put `Logging` in the end? `with Batch with SupportsReportStatistics with SupportsMetadata with Logging {`. I personally think we'd better group them by similar features. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28365: [SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical
HyukjinKwon commented on a change in pull request #28365: URL: https://github.com/apache/spark/pull/28365#discussion_r418410881 ## File path: R/pkg/R/utils.R ## @@ -354,8 +354,8 @@ varargsToStrEnv <- function(...) { } else { value <- pairs[[name]] if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) { - stop(paste0("Unsupported type for ", name, " : ", class(value), - ". Supported types are logical, numeric, character and NULL."), call. = FALSE) + stop("Unsupported type for ", name, " : ", toString(class(value)), ". ", Review comment: Quick question, why do we need to do `toString`? Were there any differences at `stop(paste(..., ...))` vs `stop(..., ...)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418410438 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala ## @@ -29,11 +29,13 @@ import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjectio import org.apache.spark.sql.connector.read.{Batch, InputPartition, Scan, Statistics, SupportsReportStatistics} import org.apache.spark.sql.execution.PartitionedFileUtil import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.internal.connector.SupportsMetadata import org.apache.spark.sql.sources.Filter import org.apache.spark.sql.types.StructType import org.apache.spark.util.Utils -trait FileScan extends Scan with Batch with SupportsReportStatistics with Logging { +trait FileScan extends Scan + with Batch with SupportsReportStatistics with Logging with SupportsMetadata { Review comment: super nit: better to put `Logging` in the end? `with Batch with SupportsReportStatistics with SupportsMetadata with Logging {` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
AmplabJenkins removed a comment on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622240424 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418410243 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala ## @@ -46,6 +47,35 @@ trait DataSourceV2ScanExecBase extends LeafExecNode { Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, result) } + /** + * Shorthand for calling redactString() without specifying redacting rules + */ + protected def redact(text: String): String = { +Utils.redact(sqlContext.sessionState.conf.stringRedactionPattern, text) + } + + Review comment: super nit: remove the single blank line. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
AmplabJenkins commented on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622240424 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418410056 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsMetadata.scala ## @@ -0,0 +1,21 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.internal.connector + +trait SupportsMetadata { + def getMetaData(): Map[String, String] +} Review comment: We don't need to move this file into the java side along with `Batch` and `SupportsReportStatistics`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28365: [SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical
HyukjinKwon commented on a change in pull request #28365: URL: https://github.com/apache/spark/pull/28365#discussion_r418409626 ## File path: R/pkg/R/DataFrame.R ## @@ -2604,18 +2603,17 @@ setMethod("join", if (is.null(joinType)) { sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc) } else { -if (joinType %in% c("inner", "cross", +valid_join_types <- c("inner", "cross", Review comment: nit: let's keep the naming rule here -> `validJoinTypes` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
SparkQA commented on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622240171 **[Test build #122156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122156/testReport)** for PR 28348 at commit [`f357d36`](https://github.com/apache/spark/commit/f357d36e9fb82b4dce6719b103138e53704fd375). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #28348: [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
huaxingao commented on pull request #28348: URL: https://github.com/apache/spark/pull/28348#issuecomment-622239789 If there are no further comments, could one of you please merge this PR for me? Thanks a lot! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418409102 ## File path: external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala ## @@ -65,4 +65,8 @@ case class AvroScan( } override def hashCode(): Int = super.hashCode() + + override def getMetaData(): Map[String, String] = { +super.metaData ++ Map("Format" -> "avro") Review comment: Could we move all the `Format` metadata into the trait-side? ``` "Format" -> s"${this.getClass.getSimpleName.replace("Scan", "").toLowerCase(Locale.ROOT)}" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
maropu commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418409233 ## File path: external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala ## @@ -65,4 +65,8 @@ case class AvroScan( } override def hashCode(): Int = super.hashCode() + + override def getMetaData(): Map[String, String] = { +super.metaData ++ Map("Format" -> "avro") Review comment: Also, could we check expain output for Avro V2 scan? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28383: [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable
HyukjinKwon commented on a change in pull request #28383: URL: https://github.com/apache/spark/pull/28383#discussion_r418408328 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala ## @@ -119,6 +119,10 @@ case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[Logic } } +if (normalizedFilters.exists(_.find(_.isInstanceOf[Unevaluable]).isDefined)) { + return child +} Review comment: Why don't you just filter out subquerties consistently with other normalized filters, by `normalizedFilters.filterNot(SubqueryExpression.hasSubquery)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
beliefer commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418405930 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala ## @@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode)) } } + + test("Explain formatted output for scan operator for datasource V2") { +withTempDir { dir => + Seq("parquet", "orc", "csv", "json").foreach { format => +val basePath = dir.getCanonicalPath + "/" + format +val expected_plan_fragment = + s""" + |\\(1\\) BatchScan + |Output \\[2\\]: \\[value#x, id#x\\] + |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\] + |Format: $format + |Location: InMemoryFileIndex\\[.*\\] + |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] + |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] Review comment: Could extract this line as variable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
beliefer commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418405736 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala ## @@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode)) } } + + test("Explain formatted output for scan operator for datasource V2") { +withTempDir { dir => + Seq("parquet", "orc", "csv", "json").foreach { format => +val basePath = dir.getCanonicalPath + "/" + format +val expected_plan_fragment = + s""" + |\\(1\\) BatchScan + |Output \\[2\\]: \\[value#x, id#x\\] + |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\] + |Format: $format + |Location: InMemoryFileIndex\\[.*\\] + |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] + |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] + |ReadSchema: struct\\ + |""".stripMargin.trim + Review comment: ``` Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"), ("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"), ("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"), ("json", "")).foreach { (format, pushedFilters) => val expected_plan_fragment = s""" |\\(1\\) BatchScan |Output \\[2\\]: \\[value#x, id#x\\] |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\] |Format: $format |Location: InMemoryFileIndex\\[.*\\] |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] ${if (pushedFilters.nonEmpty) "|PushedFilers..."} |ReadSchema: struct\\ |""".stripMargin.trim ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28425: [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
beliefer commented on a change in pull request #28425: URL: https://github.com/apache/spark/pull/28425#discussion_r418405736 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala ## @@ -343,6 +343,74 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite assert(getNormalizedExplain(df1, FormattedMode) === getNormalizedExplain(df2, FormattedMode)) } } + + test("Explain formatted output for scan operator for datasource V2") { +withTempDir { dir => + Seq("parquet", "orc", "csv", "json").foreach { format => +val basePath = dir.getCanonicalPath + "/" + format +val expected_plan_fragment = + s""" + |\\(1\\) BatchScan + |Output \\[2\\]: \\[value#x, id#x\\] + |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\] + |Format: $format + |Location: InMemoryFileIndex\\[.*\\] + |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] + |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] + |ReadSchema: struct\\ + |""".stripMargin.trim + Review comment: Seq(("parquet", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"), ("orc", "\\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]"), ("csv", "\\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]"), ("json", "")).foreach { (format, pushedFilters) => val expected_plan_fragment = s""" |\\(1\\) BatchScan |Output \\[2\\]: \\[value#x, id#x\\] |DataFilters: \\[isnotnull\\(value#x\\), \\(value#x > 2\\)\\] |Format: $format |Location: InMemoryFileIndex\\[.*\\] |PartitionFilters: \\[isnotnull\\(id#x\\), \\(id#x > 1\\)\\] |PushedFilers: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\] ${if (pushedFilters.nonEmpty) "|PushedFilers..."} |ReadSchema: struct\\ |""".stripMargin.trim This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418402122 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + Review comment: Why is this file md specifically? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins removed a comment on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622232100 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418401998 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + +## Summary Review comment: Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a main repo if this is a documentation purpose. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
maropu commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622232046 Thanks for the quick response, @HyukjinKwon ;) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418401998 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + +## Summary Review comment: Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a main repo if this is a documentation purpose. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
HyukjinKwon commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418401998 ## File path: sql/core/src/test/resources/sql-functions/sql-expression-schema.md ## @@ -0,0 +1,341 @@ + +## Summary Review comment: Hey @cloud-fan and @gatorsmile, let's don't add a generated file into a main repo. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
AmplabJenkins commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622232100 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite
SparkQA commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622231819 **[Test build #122155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122155/testReport)** for PR 28427 at commit [`948a54e`](https://github.com/apache/spark/commit/948a54e7059b29a1a001315b0ed2d69ec4104314). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
maropu commented on a change in pull request #28194: URL: https://github.com/apache/spark/pull/28194#discussion_r418401753 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile} +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.tags.ExtendedSQLTest + +// scalastyle:off line.size.limit +/** + * End-to-end test cases for SQL schemas of expression examples. + * The golden result file is "spark/sql/core/src/test/resources/sql-functions/sql-expression-schema.md". + * + * To run the entire test suite: + * {{{ + * build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * To re-generate golden files for entire suite, run: + * {{{ + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *ExpressionsSchemaSuite" + * }}} + * + * For example: + * {{{ + * ... + * @ExpressionDescription( + * usage = "_FUNC_(str, n) - Returns the string which repeats the given string value n times.", + * examples = """ + * Examples: + * > SELECT _FUNC_('123', 2); + * 123123 + * """, + * since = "1.5.0") + * case class StringRepeat(str: Expression, times: Expression) + * ... + * }}} + * + * The format for golden result files look roughly like: + * {{{ + * ... + * | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | SELECT repeat('123', 2) | struct | + * ... + * }}} + */ +// scalastyle:on line.size.limit +@ExtendedSQLTest +class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession { + + private val regenerateGoldenFiles: Boolean = System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1" + + private val baseResourcePath = { +// We use a path based on Spark home for 2 reasons: +// 1. Maven can't get correct resource directory when resources in other jars. +// 2. We test subclasses in the hive-thriftserver module. +val sparkHome = { + assert(sys.props.contains("spark.test.home") || +sys.env.contains("SPARK_HOME"), "spark.test.home or SPARK_HOME is not set.") + sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME")) +} + +java.nio.file.Paths.get(sparkHome, + "sql", "core", "src", "test", "resources", "sql-functions").toFile + } + + private val resultFile = new File(baseResourcePath, "sql-expression-schema.md") + + /** A single SQL query's SQL and schema. */ + protected case class QueryOutput( + className: String, + funcName: String, + sql: String = "N/A", + schema: String = "N/A") { +override def toString: String = { + s"| $className | $funcName | $sql | $schema |" +} + } + + test("Check schemas for expression examples") { +val exampleRe = """^(.+);\n(?s)(.+)$""".r +val funInfos = spark.sessionState.functionRegistry.listFunction().map { funcId => + spark.sessionState.catalog.lookupFunctionInfo(funcId) +} + +val classFunsMap = funInfos.groupBy(_.getClassName).toSeq.sortBy(_._1) +val outputBuffer = new ArrayBuffer[String] +val outputs = new ArrayBuffer[QueryOutput] +val missingExamples = new ArrayBuffer[String] + +classFunsMap.foreach { kv => + val className = kv._1 + kv._2.foreach { funInfo => +val example = funInfo.getExamples +val funcName = funInfo.getName.replaceAll("\\|", "") +if (example == "") { + val queryOutput = QueryOutput(className, funcName) + outputBuffer += queryOutput.toString + outputs += queryOutput + missingExamples += funcName +} + +// If expression exists 'Examples' segment, the first element is 'Examples'. Because +// this test case is only used to print aliases of expressions for double checking. +// Therefore, we only need to output the first SQL and its corresponding schema. +// Note: We need to filter out the commands that set the parameters, such as: +// SET
[GitHub] [spark] HyukjinKwon commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group
HyukjinKwon commented on pull request #28395: URL: https://github.com/apache/spark/pull/28395#issuecomment-622231583 Ah, it was already commented at https://github.com/apache/spark/pull/28194 :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group
HyukjinKwon commented on pull request #28395: URL: https://github.com/apache/spark/pull/28395#issuecomment-622231048 Thanks for letting me know. I will take a look too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of byExpressionsSchemaSuite
maropu commented on pull request #28427: URL: https://github.com/apache/spark/pull/28427#issuecomment-622230995 cc: @dongjoon-hyun This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu opened a new pull request #28427: [SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of byExpressionsSchemaSuite
maropu opened a new pull request #28427: URL: https://github.com/apache/spark/pull/28427 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
maropu commented on pull request #28194: URL: https://github.com/apache/spark/pull/28194#issuecomment-622230838 https://github.com/apache/spark/pull/28427 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28194: [SPARK-31372][SQL][TEST] Display expression schema for double check.
maropu commented on pull request #28194: URL: https://github.com/apache/spark/pull/28194#issuecomment-622230902 To recover 3.0 asap, I opened a PR to fix it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group
dongjoon-hyun edited a comment on pull request #28395: URL: https://github.com/apache/spark/pull/28395#issuecomment-69256 Thank you. The follow-up looks good. BTW, FYI, `branch-3.0` UT has been broken by another commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28395: [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group
dongjoon-hyun commented on pull request #28395: URL: https://github.com/apache/spark/pull/28395#issuecomment-69256 Thank you. The follow-up looks good. BTW, `branch-3.0` UT is broken now by another commit. :( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org