[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r157683878 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala --- @@ -126,18 +121,14 @@ class ArrowPythonRunner( private var schema: StructType = _ private var vectors: Array[ColumnVector] = _ - private var closed = false - context.addTaskCompletionListener { _ => // todo: we need something like `reader.end()`, which release all the resources, but leave --- End diff -- @ueshin is this still a valid todo? It is possible in Arrow now by calling `reader.close(false)` where the `false` flag will free resources but keep the stream open. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r157683384 --- Diff: python/pyspark/sql/types.py --- @@ -1679,6 +1678,15 @@ def from_arrow_schema(arrow_schema): for field in arrow_schema]) +def _require_minimum_pyarrow_version(): --- End diff -- @ueshin @HyukjinKwon just want to verify that we want to check for a minimum version of pyarrow and this is the right place to put this function? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryErr...
Github user sitalkedia commented on a diff in the pull request: https://github.com/apache/spark/pull/20014#discussion_r157682902 --- Diff: core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.memory; + +/** + * This exception is thrown when a task can not acquire memory from the Memory manager. + * Instead of throwing {@link OutOfMemoryError}, which kills the executor, + * we should use throw this exception, which will just kill the current task. + */ +public final class SparkOutOfMemoryError extends OutOfMemoryError { --- End diff -- Yes, this is an internal class. How do you suggest to label it ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85099/testReport)** for PR 19884 at commit [`22c6b92`](https://github.com/apache/spark/commit/22c6b92fc6ab31c332715c9372cfc1fe27835ccb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 + @holdenk @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @HyukjinKwon I agree with you ! But since Spark is scala project - A lot's of developers refer examples available here and if those are Java Developers they might don't understand that this is right way to do in Scala ! I had a same discussion in Scala Days with developers and I think it does make sense here too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157676669 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1295,87 +1295,184 @@ case class ParseToTimestamp(left: Expression, format: Option[Expression], child: override def dataType: DataType = TimestampType } -/** - * Returns date truncated to the unit specified by the format. - */ -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", - examples = """ -Examples: - > SELECT _FUNC_('2009-02-12', 'MM'); - 2009-02-01 - > SELECT _FUNC_('2015-10-27', 'YEAR'); - 2015-01-01 - """, - since = "1.5.0") -// scalastyle:on line.size.limit -case class TruncDate(date: Expression, format: Expression) - extends BinaryExpression with ImplicitCastInputTypes { - override def left: Expression = date - override def right: Expression = format - - override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) - override def dataType: DataType = DateType +trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { + val time: Expression + val format: Expression override def nullable: Boolean = true - override def prettyName: String = "trunc" private lazy val truncLevel: Int = DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + /** + * + * @param input + * @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input) + * @param truncFunc + * @tparam T + * @return + */ + protected def evalHelper[T](input: InternalRow, maxLevel: Int)( +truncFunc: (Any, Int) => T): Any = { val level = if (format.foldable) { truncLevel } else { DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) } -if (level == -1) { +if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) { --- End diff -- `// unknown format or too small level`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157678588 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1295,87 +1295,184 @@ case class ParseToTimestamp(left: Expression, format: Option[Expression], child: override def dataType: DataType = TimestampType } -/** - * Returns date truncated to the unit specified by the format. - */ -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", - examples = """ -Examples: - > SELECT _FUNC_('2009-02-12', 'MM'); - 2009-02-01 - > SELECT _FUNC_('2015-10-27', 'YEAR'); - 2015-01-01 - """, - since = "1.5.0") -// scalastyle:on line.size.limit -case class TruncDate(date: Expression, format: Expression) - extends BinaryExpression with ImplicitCastInputTypes { - override def left: Expression = date - override def right: Expression = format - - override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) - override def dataType: DataType = DateType +trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { + val time: Expression + val format: Expression override def nullable: Boolean = true - override def prettyName: String = "trunc" private lazy val truncLevel: Int = DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + /** + * + * @param input + * @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input) + * @param truncFunc + * @tparam T + * @return + */ + protected def evalHelper[T](input: InternalRow, maxLevel: Int)( +truncFunc: (Any, Int) => T): Any = { val level = if (format.foldable) { truncLevel } else { DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) } -if (level == -1) { +if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) { // unknown format null } else { - val d = date.eval(input) + val d = time.eval(input) if (d == null) { null } else { -DateTimeUtils.truncDate(d.asInstanceOf[Int], level) +truncFunc(d, level) } } } - override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { + protected def codeGenHelper[T]( --- End diff -- Why do we need a type parameter `T`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157680290 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -944,9 +954,16 @@ object DateTimeUtils { date + daysToMonthEnd } - private val TRUNC_TO_YEAR = 1 - private val TRUNC_TO_MONTH = 2 - private val TRUNC_INVALID = -1 + // Visible for testing. + val TRUNC_TO_YEAR = 1 + val TRUNC_TO_MONTH = 2 + val TRUNC_TO_DAY = 3 + val TRUNC_TO_HOUR = 4 + val TRUNC_TO_MINUTE = 5 + val TRUNC_TO_SECOND = 6 + val TRUNC_TO_WEEK = 7 + val TRUNC_TO_QUARTER = 8 + val TRUNC_INVALID = -1 --- End diff -- Can we bring quarter and week forward, maybe to 3 and 4? Then it's more conform to the order of time granularity and max-level design is not influenced. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157674840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1295,87 +1295,184 @@ case class ParseToTimestamp(left: Expression, format: Option[Expression], child: override def dataType: DataType = TimestampType } -/** - * Returns date truncated to the unit specified by the format. - */ -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", - examples = """ -Examples: - > SELECT _FUNC_('2009-02-12', 'MM'); - 2009-02-01 - > SELECT _FUNC_('2015-10-27', 'YEAR'); - 2015-01-01 - """, - since = "1.5.0") -// scalastyle:on line.size.limit -case class TruncDate(date: Expression, format: Expression) - extends BinaryExpression with ImplicitCastInputTypes { - override def left: Expression = date - override def right: Expression = format - - override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) - override def dataType: DataType = DateType +trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { + val time: Expression + val format: Expression override def nullable: Boolean = true - override def prettyName: String = "trunc" private lazy val truncLevel: Int = DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + /** + * + * @param input + * @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input) + * @param truncFunc + * @tparam T + * @return + */ + protected def evalHelper[T](input: InternalRow, maxLevel: Int)( +truncFunc: (Any, Int) => T): Any = { val level = if (format.foldable) { truncLevel } else { DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) } -if (level == -1) { +if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) { // unknown format null } else { - val d = date.eval(input) + val d = time.eval(input) --- End diff -- nit: Since this is a time, it can be `val t = ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20016 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20016 SPARK-22830 Scala Coding style has been improved in Spark Examples ## What changes were proposed in this pull request? * Under Spark Scala Examples: Some of the syntax were written like Java way, It has been re-written as per scala style guide. * Most of all changes are followed to println() statement. ## How was this patch tested? Since, All changes proposed are re-writing println statements in scala way, manual run used to test println. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark scala-style-spark-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20016 commit 4ac1cb1c2aa6f72eee339e8b8b647647e879d91f Author: chetkhatriDate: 2017-12-19T07:17:37Z SPARK-22830 Scala Coding style has been improved in Spark Examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157673626 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1295,87 +1295,184 @@ case class ParseToTimestamp(left: Expression, format: Option[Expression], child: override def dataType: DataType = TimestampType } -/** - * Returns date truncated to the unit specified by the format. - */ -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", - examples = """ -Examples: - > SELECT _FUNC_('2009-02-12', 'MM'); - 2009-02-01 - > SELECT _FUNC_('2015-10-27', 'YEAR'); - 2015-01-01 - """, - since = "1.5.0") -// scalastyle:on line.size.limit -case class TruncDate(date: Expression, format: Expression) - extends BinaryExpression with ImplicitCastInputTypes { - override def left: Expression = date - override def right: Expression = format - - override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) - override def dataType: DataType = DateType +trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { + val time: Expression + val format: Expression override def nullable: Boolean = true - override def prettyName: String = "trunc" private lazy val truncLevel: Int = DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + /** + * + * @param input --- End diff -- Seems `input` and `truncFunc` descriptions missing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157677311 --- Diff: python/pyspark/sql/functions.py --- @@ -,6 +,24 @@ def trunc(date, format): return Column(sc._jvm.functions.trunc(_to_java_column(date), format)) +@since(2.3) +def date_trunc(format, timestamp): +""" +Returns timestamp truncated to the unit specified by the format. + +:param format: 'year', '', 'yy', 'month', 'mon', 'mm', +'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER' --- End diff -- Could we make those lowercased too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r157677769 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -86,21 +86,16 @@ private[sql] object ArrowConverters { val root = VectorSchemaRoot.create(arrowSchema, allocator) val arrowWriter = ArrowWriter.create(root) -var closed = false - context.addTaskCompletionListener { _ => - if (!closed) { -root.close() -allocator.close() - } + root.close() + allocator.close() --- End diff -- Yes thanks for the reminder, they are updated now also. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157677136 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2797,6 +2797,21 @@ object functions { TruncDate(date.expr, Literal(format)) } + /** + * Returns timestamp truncated to the unit specified by the format. + * + * @param format: 'year', '', 'yy' for truncate by year, + * 'month', 'mon', 'mm' for truncate by month, --- End diff -- nit: one space each more. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157677400 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2797,6 +2797,21 @@ object functions { TruncDate(date.expr, Literal(format)) } + /** + * Returns timestamp truncated to the unit specified by the format. + * + * @param format: 'year', '', 'yy' for truncate by year, + * 'month', 'mon', 'mm' for truncate by month, + * 'day', 'dd' for truncate by day, + * Other options are: second, minute, hour, week, month, quarter --- End diff -- Maybe, `'second', 'minute', 'hour', 'week', 'month' and 'quarter'` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157673559 --- Diff: python/pyspark/sql/functions.py --- @@ -,6 +,24 @@ def trunc(date, format): return Column(sc._jvm.functions.trunc(_to_java_column(date), format)) +@since(2.3) +def date_trunc(format, timestamp): +""" +Returns timestamp truncated to the unit specified by the format. + +:param format: 'year', '', 'yy', 'month', 'mon', 'mm', +'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER' + +>>> df = spark.createDataFrame([('1997-02-28',)], ['d']) --- End diff -- Can we use a timestamp string like `1997-02-28 05:02:11` to show the difference from `trunc` a bit more clearly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20015: [SPARK-22829] Add new built-in function date_trun...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20015#discussion_r157675835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1295,87 +1295,184 @@ case class ParseToTimestamp(left: Expression, format: Option[Expression], child: override def dataType: DataType = TimestampType } -/** - * Returns date truncated to the unit specified by the format. - */ -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", - examples = """ -Examples: - > SELECT _FUNC_('2009-02-12', 'MM'); - 2009-02-01 - > SELECT _FUNC_('2015-10-27', 'YEAR'); - 2015-01-01 - """, - since = "1.5.0") -// scalastyle:on line.size.limit -case class TruncDate(date: Expression, format: Expression) - extends BinaryExpression with ImplicitCastInputTypes { - override def left: Expression = date - override def right: Expression = format - - override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) - override def dataType: DataType = DateType +trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { --- End diff -- Maybe `TruncInstant`? I received this advice before and I liked it too. Not a big deal tho. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/20004 `CSVOptions` is changed NOT to set `charToEscapeQuoteEscaping` to `\u` by default, to allow the uniVocity parser to use `escape` as `charToEscapeQuoteEscaping` character. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19975 **[Test build #85098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85098/testReport)** for PR 19975 at commit [`29094cc`](https://github.com/apache/spark/commit/29094cc79d0025989df34061832b908880c64323). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/20004 When `charToEscapeQuoteEscaping` is not set and `quote` and `escape` are different, uniVocity parser uses `escape` character as `charToEscapeQuoteEscaping` by default. This is why the test passes. Ref: https://github.com/uniVocity/univocity-parsers/blob/v2.5.9/src/main/java/com/univocity/parsers/csv/CsvFormat.java#L149-L157 I changed the test case so that the text to be parsed uses a `charToEscapeQuoteEscaping` which is different from its `escape` character. Though this case seems to be scarce, it will be why the uniVocity parser has this option exposed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19975 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85089/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85089 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85089/testReport)** for PR 19984 at commit [`359ebdd`](https://github.com/apache/spark/commit/359ebdd8bdac0b93aa6b88beab0212393f1e2577). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20004 **[Test build #85097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85097/testReport)** for PR 20004 at commit [`9680a4e`](https://github.com/apache/spark/commit/9680a4ed2613d2d7ef02afb52030cd90f75deb88). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20004 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20004 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85096/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20004 **[Test build #85096 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85096/testReport)** for PR 20004 at commit [`50dc2c0`](https://github.com/apache/spark/commit/50dc2c0983540c3d25d2970dbd7c1cb6d973f5e6). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20004 **[Test build #85096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85096/testReport)** for PR 20004 at commit [`50dc2c0`](https://github.com/apache/spark/commit/50dc2c0983540c3d25d2970dbd7c1cb6d973f5e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict caused b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #85095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85095/testReport)** for PR 19149 at commit [`9b6fe36`](https://github.com/apache/spark/commit/9b6fe3644c3fc760de9ddf33c16bc1733e7c971d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict caused b...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19149 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20014 Overall change lgtm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryErr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20014#discussion_r157673852 --- Diff: core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.memory; + +/** + * This exception is thrown when a task can not acquire memory from the Memory manager. + * Instead of throwing {@link OutOfMemoryError}, which kills the executor, + * we should use throw this exception, which will just kill the current task. + */ +public final class SparkOutOfMemoryError extends OutOfMemoryError { --- End diff -- is this an internal class? if yes perhaps we should label it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19950 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19950 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85084/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19950 **[Test build #85084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85084/testReport)** for PR 19950 at commit [`daba630`](https://github.com/apache/spark/commit/daba630343cb1d7f3ad137a75aaffb2a29b99cb6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class LabeledPointSuite extends SparkFunSuite ` * `class TreePointSuite extends SparkFunSuite ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19950 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85085/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19950 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19950 **[Test build #85085 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85085/testReport)** for PR 19950 at commit [`6f51096`](https://github.com/apache/spark/commit/6f51096060645abeb0e9b9568bca6dbb6213b16a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20012 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85088/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20012 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20012 **[Test build #85088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85088/testReport)** for PR 20012 at commit [`97fd2f9`](https://github.com/apache/spark/commit/97fd2f968d417bd7e690cf961efaac77bc2ef08d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19746: [SPARK-22346][ML] VectorSizeHint Transformer for ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19746#discussion_r157668450 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.attribute.AttributeGroup +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.{Column, DataFrame, Dataset} +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * A feature transformer that adds size information to the metadata of a vector column. + * VectorAssembler needs size information for its input columns and cannot be used on streaming + * dataframes without this metadata. + * + */ +@Experimental +@Since("2.3.0") +class VectorSizeHint @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Transformer with HasInputCol with HasHandleInvalid with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("vectSizeHint")) + + /** + * The size of Vectors in `inputCol`. + * @group param + */ + @Since("2.3.0") + val size: IntParam = new IntParam( +this, +"size", +"Size of vectors in column.", +{s: Int => s >= 0}) + + /** group getParam */ + @Since("2.3.0") + def getSize: Int = getOrDefault(size) + + /** @group setParam */ + @Since("2.3.0") + def setSize(value: Int): this.type = set(size, value) + + /** @group setParam */ + @Since("2.3.0") + def setInputCol(value: String): this.type = set(inputCol, value) + + /** + * Param for how to handle invalid entries. Invalid vectors include nulls and vectors with the + * wrong size. The options are `skip` (filter out rows with invalid vectors), `error` (throw an + * error) and `optimistic` (do not check the vector size, and keep all row\). `error` by default. --- End diff -- "row\" ==> "rows" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19985#discussion_r157667492 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala --- @@ -52,4 +53,40 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { assert(df.queryExecution.simpleString.contains(replacement)) } } + + private def isIncluded(queryExecution: QueryExecution, msg: String): Boolean = { +queryExecution.toString.contains(msg) || +queryExecution.simpleString.contains(msg) || +queryExecution.stringWithStats.contains(msg) + } + + test("explain is redacted using SQLConf") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + spark.range(0, 10).toDF("a").write.parquet(new Path(basePath, "foo=1").toString) + val df = spark.read.parquet(basePath) + val replacement = "*" + + // Respect SparkConf and replace file:/ + assert(isIncluded(df.queryExecution, replacement)) + + assert(isIncluded(df.queryExecution, "FileScan")) + assert(!isIncluded(df.queryExecution, "file:/")) + + withSQLConf(SQLConf.SQL_STRING_REDACTION_PATTERN.key -> "(?i)FileScan") { +// Respect SQLConf and replace FileScan +assert(isIncluded(df.queryExecution, replacement)) + +assert(!isIncluded(df.queryExecution, "FileScan")) +assert(isIncluded(df.queryExecution, "file:/")) + } + + // Respect SparkConf and replace file:/ + assert(isIncluded(df.queryExecution, replacement)) + + assert(isIncluded(df.queryExecution, "FileScan")) + assert(!isIncluded(df.queryExecution, "file:/")) --- End diff -- Removed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20015 OK. I am fine if you all guys strongly feel about this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20015 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85083/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20015 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20015 **[Test build #85083 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85083/testReport)** for PR 20015 at commit [`f94f401`](https://github.com/apache/spark/commit/f94f401bcfd765b21c3fb466041b42a605d6a814). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20015 > after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold Where did the discussion happen? Was this offline discussion? I also want to actively join in the discussion. Many implementations of the trunc works differently and I think we decide the current behaviour after sufficient discussion. If we don't fix the stuff about #14788 in 2.3.0 timeline, it could be even more difficult because we need to keep the previous behaviour. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r157666054 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- @foxish The libc6-compat did not fix our issue because libiomp5.so can not be loaded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19954 @vanzin @mccheah @foxish Variables names have been shortened in https://github.com/apache/spark/pull/19954/commits/d50c61e46c43049b9ede615ffabd89a55c4adb0c, and traits with only a single implementation have been removed in https://github.com/apache/spark/pull/19954/commits/5b82fc06a5fff2a3ff2062c46a21d069640eeec7. Regarding the documentation of the concepts around steps and orchestrators and how they fit together, I would suggest that we have a separate PR for the architecture docs. I also think we can refactor the orchestrator logic in a future PR to also cover executor pods so we use the same mechanism for both the driver and executor pods. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85094 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85094/testReport)** for PR 19954 at commit [`5b82fc0`](https://github.com/apache/spark/commit/5b82fc06a5fff2a3ff2062c46a21d069640eeec7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19981: [SPARK-22786][SQL] only use AppStatusPlugin in history s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19981 **[Test build #85093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85093/testReport)** for PR 19981 at commit [`5b64f88`](https://github.com/apache/spark/commit/5b64f881adcf34b958aa659e62a9ce93171cf109). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85092/testReport)** for PR 19984 at commit [`19f08a9`](https://github.com/apache/spark/commit/19f08a9c875c6e52bb75c82d196fb3a310311ffe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19981: [SPARK-22786][SQL] only use AppStatusPlugin in hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19981#discussion_r157663383 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala --- @@ -147,236 +159,246 @@ class SQLListenerSuite extends SparkFunSuite with SharedSQLContext with JsonTest (id, accumulatorValue) }.toMap -bus.postToAll(SparkListenerSQLExecutionStart( +listener.onOtherEvent(SparkListenerSQLExecutionStart( executionId, "test", "test", df.queryExecution.toString, SparkPlanInfo.fromSparkPlan(df.queryExecution.executedPlan), System.currentTimeMillis())) -bus.postToAll(SparkListenerJobStart( +listener.onJobStart(SparkListenerJobStart( jobId = 0, time = System.currentTimeMillis(), stageInfos = Seq( createStageInfo(0, 0), createStageInfo(1, 0) ), createProperties(executionId))) -bus.postToAll(SparkListenerStageSubmitted(createStageInfo(0, 0))) + listener.onStageSubmitted(SparkListenerStageSubmitted(createStageInfo(0, 0))) -assert(store.executionMetrics(0).isEmpty) +assert(statusStore.executionMetrics(executionId).isEmpty) -bus.postToAll(SparkListenerExecutorMetricsUpdate("", Seq( + listener.onExecutorMetricsUpdate(SparkListenerExecutorMetricsUpdate("", Seq( // (task id, stage id, stage attempt, accum updates) (0L, 0, 0, createAccumulatorInfos(accumulatorUpdates)), (1L, 0, 0, createAccumulatorInfos(accumulatorUpdates)) ))) -checkAnswer(store.executionMetrics(0), accumulatorUpdates.mapValues(_ * 2)) +checkAnswer(statusStore.executionMetrics(executionId), accumulatorUpdates.mapValues(_ * 2)) // Driver accumulator updates don't belong to this execution should be filtered and no // exception will be thrown. -bus.postToAll(SparkListenerDriverAccumUpdates(0, Seq((999L, 2L +listener.onOtherEvent(SparkListenerDriverAccumUpdates(0, Seq((999L, 2L -checkAnswer(store.executionMetrics(0), accumulatorUpdates.mapValues(_ * 2)) +checkAnswer(statusStore.executionMetrics(executionId), accumulatorUpdates.mapValues(_ * 2)) -bus.postToAll(SparkListenerExecutorMetricsUpdate("", Seq( + listener.onExecutorMetricsUpdate(SparkListenerExecutorMetricsUpdate("", Seq( // (task id, stage id, stage attempt, accum updates) (0L, 0, 0, createAccumulatorInfos(accumulatorUpdates)), (1L, 0, 0, createAccumulatorInfos(accumulatorUpdates.mapValues(_ * 2))) ))) -checkAnswer(store.executionMetrics(0), accumulatorUpdates.mapValues(_ * 3)) +checkAnswer(statusStore.executionMetrics(executionId), accumulatorUpdates.mapValues(_ * 3)) // Retrying a stage should reset the metrics -bus.postToAll(SparkListenerStageSubmitted(createStageInfo(0, 1))) + listener.onStageSubmitted(SparkListenerStageSubmitted(createStageInfo(0, 1))) -bus.postToAll(SparkListenerExecutorMetricsUpdate("", Seq( + listener.onExecutorMetricsUpdate(SparkListenerExecutorMetricsUpdate("", Seq( // (task id, stage id, stage attempt, accum updates) (0L, 0, 1, createAccumulatorInfos(accumulatorUpdates)), (1L, 0, 1, createAccumulatorInfos(accumulatorUpdates)) ))) -checkAnswer(store.executionMetrics(0), accumulatorUpdates.mapValues(_ * 2)) +checkAnswer(statusStore.executionMetrics(executionId), accumulatorUpdates.mapValues(_ * 2)) // Ignore the task end for the first attempt -bus.postToAll(SparkListenerTaskEnd( +listener.onTaskEnd(SparkListenerTaskEnd( stageId = 0, stageAttemptId = 0, taskType = "", reason = null, createTaskInfo(0, 0, accums = accumulatorUpdates.mapValues(_ * 100)), null)) -checkAnswer(store.executionMetrics(0), accumulatorUpdates.mapValues(_ * 2)) +checkAnswer(statusStore.executionMetrics(executionId), accumulatorUpdates.mapValues(_ * 2)) // Finish two tasks -bus.postToAll(SparkListenerTaskEnd( +listener.onTaskEnd(SparkListenerTaskEnd( stageId = 0, stageAttemptId = 1, taskType = "", reason = null, createTaskInfo(0, 0, accums = accumulatorUpdates.mapValues(_ * 2)), null)) -bus.postToAll(SparkListenerTaskEnd( +listener.onTaskEnd(SparkListenerTaskEnd( stageId = 0, stageAttemptId = 1, taskType = "", reason = null, createTaskInfo(1, 0, accums = accumulatorUpdates.mapValues(_ * 3)), null)) -
[GitHub] spark issue #19981: [SPARK-22786][SQL] only use AppStatusPlugin in history s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19981 **[Test build #85091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85091/testReport)** for PR 19981 at commit [`60421ac`](https://github.com/apache/spark/commit/60421acf5c652731d64ffad013bd4c179d6401b5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19981: [SPARK-22786][SQL] only use AppStatusPlugin in hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19981#discussion_r157663315 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala --- @@ -36,11 +36,12 @@ import org.apache.spark.sql.catalyst.util.quietly import org.apache.spark.sql.execution.{LeafExecNode, QueryExecution, SparkPlanInfo, SQLExecution} import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} import org.apache.spark.sql.test.SharedSQLContext -import org.apache.spark.status.config._ +import org.apache.spark.status.config.LIVE_ENTITY_UPDATE_PERIOD import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, LongAccumulator} import org.apache.spark.util.kvstore.InMemoryStore -class SQLListenerSuite extends SparkFunSuite with SharedSQLContext with JsonTestUtils { + +class SQLAppStatusListenerSuite extends SparkFunSuite with SharedSQLContext with JsonTestUtils { --- End diff -- Then it's an existing problem of `SQLListenerSuite`, as previously it didn't only test `SQLListener`. This should not stop us from renaming it after we rename `SQLListener` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19985 **[Test build #85090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85090/testReport)** for PR 19985 at commit [`75c1479`](https://github.com/apache/spark/commit/75c1479d62045f3b8f987d38d88fd2e6be88e57b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19981: [SPARK-22786][SQL] only use AppStatusPlugin in hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19981#discussion_r157662992 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -322,15 +321,18 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) (new InMemoryStore(), true) } +val plugins = ServiceLoader.load( + classOf[AppHistoryServerPlugin], Utils.getContextOrSparkClassLoader).asScala val trackingStore = new ElementTrackingStore(kvstore, conf) --- End diff -- do we really need to limit the UI data for history server? cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user youngbink commented on the issue: https://github.com/apache/spark/pull/20015 @HyukjinKwon Just took a look at this PR #14788. My point of mentioning those databases was just to give examples of the function that Spark doesn't support but other databases commonly do. (They all have this `date_trunc` which takes `timestamp` and output `timestamp`) As you said, we could extend `trunc` and simply create an alias `date_trunc`, but it's actually not as simple. For e.g, PR #14788 won't be able to handle the following command collectly on PySpark: ``` df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d']) df.select(functions.trunc(df.d, 'year').alias('year')).collect() df.select(functions.trunc(df.d, 'SS').alias('SS')).collect() ``` This is because `trunc(string, string)` isn't correctly handled. We could find a way around this and get it working, but after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold, we decided to add `date_trunc` to be compatible with Postgres for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19149#discussion_r157661943 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -47,7 +47,62 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations) def batches: Seq[Batch] = { -Batch("Eliminate Distinct", Once, EliminateDistinct) :: +val operatorOptimizationRuleSet = + Seq( +// Operator push down +PushProjectionThroughUnion, +ReorderJoin, +EliminateOuterJoin, +PushPredicateThroughJoin, +PushDownPredicate, +LimitPushDown, +ColumnPruning, +InferFiltersFromConstraints, --- End diff -- Yeah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20014 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20014 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85081/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20014 **[Test build #85081 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85081/testReport)** for PR 20014 at commit [`4254d3e`](https://github.com/apache/spark/commit/4254d3e35bf6d8c34435c7d7456d18e4b86a7e59). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85086/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85086 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)** for PR 19904 at commit [`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20000: [SPARK-22815] [SQL] Keep PromotePrecision in Optimized P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85082/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20000: [SPARK-22815] [SQL] Keep PromotePrecision in Optimized P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20000: [SPARK-22815] [SQL] Keep PromotePrecision in Optimized P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #85082 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85082/testReport)** for PR 2 at commit [`3f27c4b`](https://github.com/apache/spark/commit/3f27c4b839a352716c204d1580f17d6625f7bee4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85089 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85089/testReport)** for PR 19984 at commit [`359ebdd`](https://github.com/apache/spark/commit/359ebdd8bdac0b93aa6b88beab0212393f1e2577). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19892 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19892 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85087/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19892 **[Test build #85087 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85087/testReport)** for PR 19892 at commit [`e1fb379`](https://github.com/apache/spark/commit/e1fb379ba710a7656d73f60c45c18dbf0761a98c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19988: [Spark-22795] [ML] Raise error when line search in First...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19988 Hm that's a good point. It means the optimization isn't sure it's making progress anymore because the input is too weird but I guess that doesn't say much about the actual quality of the solution --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20012 **[Test build #85088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85088/testReport)** for PR 20012 at commit [`97fd2f9`](https://github.com/apache/spark/commit/97fd2f968d417bd7e690cf961efaac77bc2ef08d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19988: [Spark-22795] [ML] Raise error when line search in First...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19988 @srowen Wait... @jkbradley seems to have more thoughts about this: Question: When line search failed, does it mean the model is always meaning-less ? Maybe we need more discussion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19892 **[Test build #85087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85087/testReport)** for PR 19892 at commit [`e1fb379`](https://github.com/apache/spark/commit/e1fb379ba710a7656d73f60c45c18dbf0761a98c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85079/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85079 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85079/testReport)** for PR 19984 at commit [`2af9b40`](https://github.com/apache/spark/commit/2af9b40d60027397fd6e915413b88f12249245e5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85086 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)** for PR 19904 at commit [`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r157653784 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -479,4 +485,43 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-22673: InMemoryRelation should utilize existing stats of the plan to be cached") { +withSQLConf("spark.sql.cbo.enabled" -> "true") { + withTempPath { workDir => +withTable("table1") { + val workDirPath = workDir.getAbsolutePath + val data = Seq(100, 200, 300, 400).toDF("count") + data.write.parquet(workDirPath) + val dfFromFile = spark.read.parquet(workDirPath).cache() + val inMemoryRelation = dfFromFile.queryExecution.optimizedPlan.collect { +case plan: InMemoryRelation => plan + }.head + // InMemoryRelation's stats is file size before the underlying RDD is materialized + assert(inMemoryRelation.computeStats().sizeInBytes === 740) + + // InMemoryRelation's stats is updated after materializing RDD + dfFromFile.collect() + assert(inMemoryRelation.computeStats().sizeInBytes === 16) + + // test of catalog table + val dfFromTable = spark.catalog.createTable("table1", workDirPath).cache() + val inMemoryRelation2 = dfFromTable.queryExecution.optimizedPlan. +collect { case plan: InMemoryRelation => plan }.head + + // Even CBO enabled, InMemoryRelation's stats keeps as the file size before table's stats + // is calculated + assert(inMemoryRelation2.computeStats().sizeInBytes === 740) + + // InMemoryRelation's stats should be updated after calculating stats of the table + // clear cache to simulate a fresh environment + dfFromTable.unpersist(blocking = true) + spark.sql("ANALYZE TABLE table1 COMPUTE STATISTICS") + val inMemoryRelation3 = spark.read.table("table1").cache().queryExecution.optimizedPlan. +collect { case plan: InMemoryRelation => plan }.head + assert(inMemoryRelation3.computeStats().sizeInBytes === 48) --- End diff -- because 16 is the `exact` in-memory size which is got by reading the accumulator's value after evaluating the RDD 48 is calculated by EstimationUtils: https://github.com/apache/spark/blob/bdb5e55c2a67d16a36ad6baa22296d714d3525af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala#L78 `(8 + 4 (average attribute length)) * 4` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19950: [SPARK-22450][Core][MLLib][FollowUp] safely regis...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19950#discussion_r157653695 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreePointSuite.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.serializer.KryoSerializer + +class TreePointSuite extends SparkFunSuite{ + test("Kryo class register") { +val conf = new SparkConf(false) +conf.set("spark.kryo.registrationRequired", "true") + +val ser = new KryoSerializer(conf).newInstance() + +def check(p: TreePoint): Unit = { --- End diff -- I also think there is not much value to do this, although current testsuites are all like this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19350: [SPARK-22126][ML][WIP] Fix model-specific optimization s...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19350 Design changed. I will create new PR for this later. New design is here https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit?usp=sharing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19350: [SPARK-22126][ML][WIP] Fix model-specific optimiz...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19350 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML][WIP] Fix model-specific optimization s...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19857 The design of this issue changed. @MrBago will take this over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19857: [SPARK-22667][ML][WIP] Fix model-specific optimiz...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19857 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19950 **[Test build #85085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85085/testReport)** for PR 19950 at commit [`6f51096`](https://github.com/apache/spark/commit/6f51096060645abeb0e9b9568bca6dbb6213b16a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20015: [SPARK-22829] Add new built-in function date_trunc()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20015 @cloud-fan and @youngbink how about reviving https://github.com/apache/spark/pull/14788 with a configuration to control this? AWS Redshift seems having `TRUNC` which just converts a timestamp to a date whereas we have Spark's `trunc` where supports date formats. This is not quite equivalent. I think Spark's `trunc` is more like Redshift's `DATE_TRUNC`. PostgreSQL does not have `trunc` but has `date_trunc` where we can specify the format and returns a timestamp always. Presto also looks not having a duplicated functionality. I think we can simply introduce an alias for `trunc` after resolving https://github.com/apache/spark/pull/14788. Did I maybe miss something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19950 **[Test build #85084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85084/testReport)** for PR 19950 at commit [`daba630`](https://github.com/apache/spark/commit/daba630343cb1d7f3ad137a75aaffb2a29b99cb6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19933: [SPARK-22744][CORE] Add a configuration to show the appl...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/19933 Thanks your time to discuss so much with me @srowen. Like MapReduce does, the submit host should be set automatically in submitting logical rather than expecting Spark application owner to set. Almost application owners don't care about it so they won't set it in configuration. But if it can be found in WebUI, it would be very helpful to platform/infra team. So the code I contributed `sparkConf.set("spark.submit.hostname", Utils.localHostName)` in SparkSubmit.scala has no side effect to application onwers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r157652482 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -479,4 +485,43 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-22673: InMemoryRelation should utilize existing stats of the plan to be cached") { +withSQLConf("spark.sql.cbo.enabled" -> "true") { + withTempPath { workDir => +withTable("table1") { + val workDirPath = workDir.getAbsolutePath + val data = Seq(100, 200, 300, 400).toDF("count") + data.write.parquet(workDirPath) + val dfFromFile = spark.read.parquet(workDirPath).cache() + val inMemoryRelation = dfFromFile.queryExecution.optimizedPlan.collect { +case plan: InMemoryRelation => plan + }.head + // InMemoryRelation's stats is file size before the underlying RDD is materialized + assert(inMemoryRelation.computeStats().sizeInBytes === 740) + + // InMemoryRelation's stats is updated after materializing RDD + dfFromFile.collect() + assert(inMemoryRelation.computeStats().sizeInBytes === 16) + + // test of catalog table + val dfFromTable = spark.catalog.createTable("table1", workDirPath).cache() + val inMemoryRelation2 = dfFromTable.queryExecution.optimizedPlan. +collect { case plan: InMemoryRelation => plan }.head + + // Even CBO enabled, InMemoryRelation's stats keeps as the file size before table's stats + // is calculated + assert(inMemoryRelation2.computeStats().sizeInBytes === 740) + + // InMemoryRelation's stats should be updated after calculating stats of the table + // clear cache to simulate a fresh environment + dfFromTable.unpersist(blocking = true) + spark.sql("ANALYZE TABLE table1 COMPUTE STATISTICS") + val inMemoryRelation3 = spark.read.table("table1").cache().queryExecution.optimizedPlan. +collect { case plan: InMemoryRelation => plan }.head + assert(inMemoryRelation3.computeStats().sizeInBytes === 48) --- End diff -- missed this one, why does it have a different stats than the table cache stats `16`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19985#discussion_r157652242 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala --- @@ -52,4 +53,40 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { assert(df.queryExecution.simpleString.contains(replacement)) } } + + private def isIncluded(queryExecution: QueryExecution, msg: String): Boolean = { +queryExecution.toString.contains(msg) || +queryExecution.simpleString.contains(msg) || +queryExecution.stringWithStats.contains(msg) + } + + test("explain is redacted using SQLConf") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + spark.range(0, 10).toDF("a").write.parquet(new Path(basePath, "foo=1").toString) + val df = spark.read.parquet(basePath) + val replacement = "*" + + // Respect SparkConf and replace file:/ + assert(isIncluded(df.queryExecution, replacement)) + + assert(isIncluded(df.queryExecution, "FileScan")) + assert(!isIncluded(df.queryExecution, "file:/")) + + withSQLConf(SQLConf.SQL_STRING_REDACTION_PATTERN.key -> "(?i)FileScan") { +// Respect SQLConf and replace FileScan +assert(isIncluded(df.queryExecution, replacement)) + +assert(!isIncluded(df.queryExecution, "FileScan")) +assert(isIncluded(df.queryExecution, "file:/")) + } + + // Respect SparkConf and replace file:/ + assert(isIncluded(df.queryExecution, replacement)) + + assert(isIncluded(df.queryExecution, "FileScan")) + assert(!isIncluded(df.queryExecution, "file:/")) --- End diff -- what the difference between these 3 lines and https://github.com/apache/spark/pull/19985/files#diff-0c515221ed6e6eadcec71b3b9ad3a3e1R70 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19985#discussion_r157652152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -231,6 +231,13 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { """.stripMargin.trim } + /** + * Redact the sensitive information in the given string. + */ + private def withRedaction(message: => String): String = { --- End diff -- `=> String` looks not very useful here, we need to materialize anyway when calling `Utils.redact` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19985 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85077/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19985: [SPARK-22791] [SQL] [SS] Redact Output of Explain
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19985 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20008: [SPARK-22822][TEST] Basic tests for FunctionArgum...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20008#discussion_r157651436 --- Diff: sql/core/src/test/resources/sql-tests/inputs/typeCoercion/native/decimalPrecision.sql --- @@ -0,0 +1,6883 @@ +-- +-- Licensed to the Apache Software Foundation (ASF) under one or more +-- contributor license agreements. See the NOTICE file distributed with +-- this work for additional information regarding copyright ownership. +-- The ASF licenses this file to You under the Apache License, Version 2.0 +-- (the "License"); you may not use this file except in compliance with +-- the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, software +-- distributed under the License is distributed on an "AS IS" BASIS, +-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +-- See the License for the specific language governing permissions and +-- limitations under the License. +-- + +CREATE TEMPORARY VIEW t AS SELECT 1; + +SELECT cast(1 as tinyint) + cast(1 as decimal(1, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(3, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(4, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(5, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(6, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(10, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(11, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(20, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(21, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(38, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(39, 0)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(1, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(2, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(3, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(4, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(5, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(6, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(10, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(11, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(20, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(21, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(38, 1)) FROM t; +SELECT cast(1 as tinyint) + cast(1 as decimal(39, 1)) FROM t; --- End diff -- How about only these 4 decimals: `DECIMAL(3, 0)`, `DECIMAL(5, 0)`, `DECIMAL(10, 0)` and `DECIMAL(20, 0)`. https://github.com/apache/spark/blob/00d176d2fe7bbdf55cb3146a9cb04ca99b1858b7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala#L54-L57 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org