[GitHub] [spark] ScrapCodes commented on pull request #29334: [SPARK-32495][2.4] Update jackson versions to a maintained release, to fix various security vulnerabilities.
ScrapCodes commented on pull request #29334: URL: https://github.com/apache/spark/pull/29334#issuecomment-678919877 Thank you @cowtowncoder, @srowen and @Fokko. Indeed, the Security vulnerabilities serve the purpose of generating the false alarm only and do not apply to spark, however if some client application depends on Spark and uses jackson-databind, they need to deal with security issues on their own. Best thing to do is upgrade to 3.0, but it is sort of difficult to upgrade for folks who have recently upgraded to Spark 2.4.x . This is also the reason we are still maintaining the release version 2.4.x. Lot of great suggestions have chimed in, shading the jar comes with it's own set of complexity. I am not absolutely sure, but If we cannot upgrade as is, I had suggest we can re-consider this later. Thanks again everyone for chiming in and providing valuable suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins removed a comment on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678918248 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678918248 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
maropu commented on a change in pull request #29421: URL: https://github.com/apache/spark/pull/29421#discussion_r475356961 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala ## @@ -182,7 +182,11 @@ class HiveScriptTransformationSuite extends BaseScriptTransformationSuite with T identity, df.select( 'a.cast("string").as("key"), - 'b.cast("string").as("value")).collect()) + concat_ws("\t", +'b.cast("string"), +'c.cast("string"), +'d.cast("string"), +'e.cast("string")).as("value")).collect()) Review comment: Oh, I see. In the case, we should return NULL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
SparkQA commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678917855 **[Test build #127830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127830/testReport)** for PR 29421 at commit [`5f03222`](https://github.com/apache/spark/commit/5f032229ca2c457753622e21e22d92848de24fa6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
maropu commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678916433 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #29485: [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
maropu commented on a change in pull request #29485: URL: https://github.com/apache/spark/pull/29485#discussion_r475352241 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala ## @@ -328,27 +328,46 @@ object TypeCoercion { */ object WidenSetOperationTypes extends Rule[LogicalPlan] { -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp { - case s @ Except(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Except(newChildren.head, newChildren.last, isAll) - - case s @ Intersect(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Intersect(newChildren.head, newChildren.last, isAll) - - case s: Union if s.childrenResolved && !s.byName && +def apply(plan: LogicalPlan): LogicalPlan = { + val exprIdMapArray = mutable.ArrayBuffer[(ExprId, Attribute)]() + val newPlan = plan resolveOperatorsUp { +case s @ Except(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Except(newChildren.head, newChildren.last, isAll) + +case s @ Intersect(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Intersect(newChildren.head, newChildren.last, isAll) + +case s: Union if s.childrenResolved && !s.byName && s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children) -s.copy(children = newChildren) + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(s.children) + exprIdMapArray ++= newExprIds + s.copy(children = newChildren) + } + + // Re-maps existing references to the new ones (exprId and dataType) + // for aliases added when widening columns' data types. Review comment: Yea, I tried it first, but `RemoveNoopOperators` will remove a `Project` with a rewritten alias https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L480 Because it assumes projects having common exprIds have the same output. There may be a way to avoid the case and I'll check `TimeWindowing`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29485: [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
cloud-fan commented on a change in pull request #29485: URL: https://github.com/apache/spark/pull/29485#discussion_r475348817 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala ## @@ -328,27 +328,46 @@ object TypeCoercion { */ object WidenSetOperationTypes extends Rule[LogicalPlan] { -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp { - case s @ Except(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Except(newChildren.head, newChildren.last, isAll) - - case s @ Intersect(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Intersect(newChildren.head, newChildren.last, isAll) - - case s: Union if s.childrenResolved && !s.byName && +def apply(plan: LogicalPlan): LogicalPlan = { + val exprIdMapArray = mutable.ArrayBuffer[(ExprId, Attribute)]() + val newPlan = plan resolveOperatorsUp { +case s @ Except(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Except(newChildren.head, newChildren.last, isAll) + +case s @ Intersect(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Intersect(newChildren.head, newChildren.last, isAll) + +case s: Union if s.childrenResolved && !s.byName && s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children) -s.copy(children = newChildren) + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(s.children) + exprIdMapArray ++= newExprIds + s.copy(children = newChildren) + } + + // Re-maps existing references to the new ones (exprId and dataType) + // for aliases added when widening columns' data types. Review comment: Yes, like re-alias with exprId=1 Just did a quick search, rule `TimeWindowing`, `Aggregation` did it. AFAIK it's common when need to change the plan in the middle and don't want to affect the parent nodes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #29485: [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
maropu commented on a change in pull request #29485: URL: https://github.com/apache/spark/pull/29485#discussion_r475347837 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala ## @@ -328,27 +328,46 @@ object TypeCoercion { */ object WidenSetOperationTypes extends Rule[LogicalPlan] { -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp { - case s @ Except(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Except(newChildren.head, newChildren.last, isAll) - - case s @ Intersect(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Intersect(newChildren.head, newChildren.last, isAll) - - case s: Union if s.childrenResolved && !s.byName && +def apply(plan: LogicalPlan): LogicalPlan = { + val exprIdMapArray = mutable.ArrayBuffer[(ExprId, Attribute)]() + val newPlan = plan resolveOperatorsUp { +case s @ Except(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Except(newChildren.head, newChildren.last, isAll) + +case s @ Intersect(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Intersect(newChildren.head, newChildren.last, isAll) + +case s: Union if s.childrenResolved && !s.byName && s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children) -s.copy(children = newChildren) + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(s.children) + exprIdMapArray ++= newExprIds + s.copy(children = newChildren) + } + + // Re-maps existing references to the new ones (exprId and dataType) + // for aliases added when widening columns' data types. Review comment: You meant re-alias with exprId=1 in the example above like this? ``` org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] <- re-alias with exprId=#1 ?! : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29485: [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
viirya commented on a change in pull request #29485: URL: https://github.com/apache/spark/pull/29485#discussion_r475347688 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala ## @@ -328,27 +328,46 @@ object TypeCoercion { */ object WidenSetOperationTypes extends Rule[LogicalPlan] { -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp { - case s @ Except(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Except(newChildren.head, newChildren.last, isAll) - - case s @ Intersect(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Intersect(newChildren.head, newChildren.last, isAll) - - case s: Union if s.childrenResolved && !s.byName && +def apply(plan: LogicalPlan): LogicalPlan = { + val exprIdMapArray = mutable.ArrayBuffer[(ExprId, Attribute)]() + val newPlan = plan resolveOperatorsUp { +case s @ Except(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Except(newChildren.head, newChildren.last, isAll) + +case s @ Intersect(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Intersect(newChildren.head, newChildren.last, isAll) + +case s: Union if s.childrenResolved && !s.byName && s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children) -s.copy(children = newChildren) + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(s.children) + exprIdMapArray ++= newExprIds + s.copy(children = newChildren) + } + + // Re-maps existing references to the new ones (exprId and dataType) + // for aliases added when widening columns' data types. Review comment: I thought about it too. But I'm not sure if duplicate exprId is okay. If this is common way, it sounds simple and safe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tanelk commented on pull request #29515: [WIP][SPARK-32688][SQL][TESTS] Add special values to LiteralGenerator for float and double
tanelk commented on pull request #29515: URL: https://github.com/apache/spark/pull/29515#issuecomment-678908920 There is a `org.apache.spark.sql.RandomDataGenerator`, that does pretty much the same thing as the `LiteralGenerator`. Perhaps they should be unified? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
cchighman commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-678908616 I intend to update the PR based on comments, I'll try to swing around to it this evening. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cchighman commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
cchighman commented on a change in pull request #28841: URL: https://github.com/apache/spark/pull/28841#discussion_r475347166 ## File path: docs/sql-data-sources-generic-options.md ## @@ -119,3 +119,48 @@ To load all files recursively, you can use: {% include_example recursive_file_lookup r/RSparkSQLExample.R %} + +### Modification Time Path Filters +`modifiedBefore` and `modifiedAfter` are options that can be +applied together or separately in order to achieve greater +granularity over which files may load during a Spark batch query. + +When the `timeZone` option is present, modified timestamps will be +interpreted according to the specified zone. When a timezone option +is not provided, modified timestamps will be interpreted according +to the default zone specified within the Spark configuration. Without +any timezone configuration, modified timestamps are interpreted as UTC. + +`modifiedBefore` will only allow files having last modified +timestamps occurring before the specified time to load. For example, +when`modifiedBefore` has the timestamp `2020-06-01T12:00:00` applied, +all files modified after that time will not be considered when loading +from a file data source. + +`modifiedAfter` only allows files having last modified timestamps +occurring after the specified timestamp. For example, when`modifiedAfter` +has the timestamp `2020-06-01T12:00:00` applied, only files modified after +this time will be eligible when loading from a file data source. When both +`modifiedBefore` and `modifiedAfter` are specified together, files having +last modified timestamps within the resulting time range are the only files +allowed to load. Review comment: Will update ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PathFilterSuite.scala ## @@ -0,0 +1,501 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File +import java.time.{LocalDateTime, ZoneOffset} +import java.time.format.DateTimeFormatter + +import org.apache.hadoop.fs.Path + +import org.apache.spark.sql.{AnalysisException, QueryTest, Row} +import org.apache.spark.sql.catalyst.util.{stringToFile, CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.test.SharedSparkSession + +class PathFilterSuite extends QueryTest with SharedSparkSession { + import testImplicits._ + + test("SPARK-31962: when modifiedAfter specified with a past date") { +withTempDir { dir => + val path = new Path(dir.getCanonicalPath) + val file = new File(dir, "file1.csv") + stringToFile(file, "text") + file.setLastModified(DateTimeUtils.currentTimestamp()) + val df = spark.read +.option("modifiedAfter", "2019-05-10T01:11:00") +.format("csv") +.load(path.toString) + assert(df.count() == 1) +} + } + + test("SPARK-31962: when modifiedBefore specified with a future date") { +withTempDir { dir => + val path = new Path(dir.getCanonicalPath) + val file = new File(dir, "file1.csv") + stringToFile(file, "text") + val df = spark.read +.option("modifiedBefore", "2090-05-10T01:11:00") +.format("csv") +.load(path.toString) + assert(df.count() == 1) +} + } + + test("SPARK-31962: when modifiedBefore specified with a past date") { +withTempDir { dir => + val path = new Path(dir.getCanonicalPath) + val file = new File(dir, "file1.csv") + stringToFile(file, "text") + file.setLastModified(DateTimeUtils.currentTimestamp()) + val msg = intercept[AnalysisException] { +spark.read + .option("modifiedBefore", "1984-05-01T01:00:00") + .format("csv") + .load(path.toString) + }.getMessage + assert(msg.contains("Unable to infer schema for CSV")) +} + } + + test("SPARK-31962: when modifiedAfter specified with a past date, multiple files, one valid") { +withTempDir { dir => + val path = new Path(dir.getCanonicalPath) + val file1 = new File(dir, "file1.csv") + val file2 = new File(dir, "file2.csv") + stringToFile(file1, "text") + stringToFile(file2, "text") +
[GitHub] [spark] AmplabJenkins commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
AmplabJenkins commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-678908136 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
maropu commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678907495 Ah, I see. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29485: [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
cloud-fan commented on a change in pull request #29485: URL: https://github.com/apache/spark/pull/29485#discussion_r475346278 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala ## @@ -328,27 +328,46 @@ object TypeCoercion { */ object WidenSetOperationTypes extends Rule[LogicalPlan] { -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp { - case s @ Except(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Except(newChildren.head, newChildren.last, isAll) - - case s @ Intersect(left, right, isAll) if s.childrenResolved && -left.output.length == right.output.length && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(left :: right :: Nil) -assert(newChildren.length == 2) -Intersect(newChildren.head, newChildren.last, isAll) - - case s: Union if s.childrenResolved && !s.byName && +def apply(plan: LogicalPlan): LogicalPlan = { + val exprIdMapArray = mutable.ArrayBuffer[(ExprId, Attribute)]() + val newPlan = plan resolveOperatorsUp { +case s @ Except(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Except(newChildren.head, newChildren.last, isAll) + +case s @ Intersect(left, right, isAll) if s.childrenResolved && + left.output.length == right.output.length && !s.resolved => + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(left :: right :: Nil) + exprIdMapArray ++= newExprIds + assert(newChildren.length == 2) + Intersect(newChildren.head, newChildren.last, isAll) + +case s: Union if s.childrenResolved && !s.byName && s.children.forall(_.output.length == s.children.head.output.length) && !s.resolved => -val newChildren: Seq[LogicalPlan] = buildNewChildrenWithWiderTypes(s.children) -s.copy(children = newChildren) + val (newChildren, newExprIds) = buildNewChildrenWithWiderTypes(s.children) + exprIdMapArray ++= newExprIds + s.copy(children = newChildren) + } + + // Re-maps existing references to the new ones (exprId and dataType) + // for aliases added when widening columns' data types. Review comment: Another common way to solve this issue is to create an `Alias` with the existing exprId, so that we don't need to rewrite the parent nodes. I think it's safer than rewriting the parent nodes. We rewrite parent nodes in `ResolveReferences.dedupRight`, we hit bugs and end up with a complicated solution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
AmplabJenkins removed a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678907031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
viirya commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678906923 @maropu I think #29406 was only merged to master, so we don't need to backport this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
AmplabJenkins commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678907031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
SparkQA removed a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678844753 **[Test build #127819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127819/testReport)** for PR 29526 at commit [`d16f654`](https://github.com/apache/spark/commit/d16f65482820746299868a3572a42129d7e3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
SparkQA commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678906358 **[Test build #127819 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127819/testReport)** for PR 29526 at commit [`d16f654`](https://github.com/apache/spark/commit/d16f65482820746299868a3572a42129d7e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
maropu commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678906274 Merged to master. @viirya Looks like conflicts with bnrahc3.0. Could you backport it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu edited a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
maropu edited a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678906274 Merged to master. @viirya Looks like conflicts with branch-3.0. Could you backport it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
viirya commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678906138 Thanks all! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu closed pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
maropu closed pull request #29526: URL: https://github.com/apache/spark/pull/29526 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
AmplabJenkins removed a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678905843 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
AmplabJenkins commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678905843 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
SparkQA removed a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678845961 **[Test build #127820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127820/testReport)** for PR 29526 at commit [`b37f694`](https://github.com/apache/spark/commit/b37f6949f1f7c4c6d2264559402a963eb077990d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
SparkQA commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678905088 **[Test build #127820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127820/testReport)** for PR 29526 at commit [`b37f694`](https://github.com/apache/spark/commit/b37f6949f1f7c4c6d2264559402a963eb077990d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins removed a comment on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678903482 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127823/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
SparkQA commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-678903820 **[Test build #127829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127829/testReport)** for PR 29414 at commit [`dabae9b`](https://github.com/apache/spark/commit/dabae9b38038c06f8b3f1e9a7b6b5be04150b667). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins removed a comment on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678903475 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
SparkQA removed a comment on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678859868 **[Test build #127823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127823/testReport)** for PR 29421 at commit [`5f03222`](https://github.com/apache/spark/commit/5f032229ca2c457753622e21e22d92848de24fa6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678903475 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
srowen commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678903211 Oh yeah, to backport, you would need to check out branch-3.0, cherry-pick the commit, and the push straight to branch-3.0. It's not hard, just doesn't use the script (I don't know why it doesn't work anymore). Just takes a little care to make sure you push what you mean and where! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
SparkQA commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678903206 **[Test build #127823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127823/testReport)** for PR 29421 at commit [`5f03222`](https://github.com/apache/spark/commit/5f032229ca2c457753622e21e22d92848de24fa6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AngersZh commented on a change in pull request #29414: URL: https://github.com/apache/spark/pull/29414#discussion_r475341428 ## File path: sql/core/src/test/resources/sql-tests/results/transform.sql.out ## @@ -0,0 +1,224 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 15 + + +-- !query +CREATE OR REPLACE TEMPORARY VIEW t AS SELECT * FROM VALUES +('1', true, unhex('537061726B2053514C'), tinyint(1), 1, smallint(100), bigint(1), float(1.0), 1.0, Decimal(1.0), timestamp('1997-01-02'), date('2000-04-01')), +('2', false, unhex('537061726B2053514C'), tinyint(2), 2, smallint(200), bigint(2), float(2.0), 2.0, Decimal(2.0), timestamp('1997-01-02 03:04:05'), date('2000-04-02')), +('3', true, unhex('537061726B2053514C'), tinyint(3), 3, smallint(300), bigint(3), float(3.0), 3.0, Decimal(3.0), timestamp('1997-02-10 17:32:01-08'), date('2000-04-03')) +AS t(a, b, c, d, e, f, g, h, i, j, k, l) +-- !query schema +struct<> +-- !query output + + + +-- !query +SELECT TRANSFORM(a) +USING 'cat' AS (a) +FROM t +-- !query schema +struct +-- !query output +1 +2 +3 + + +-- !query +SELECT TRANSFORM(a) +USING 'some_non_existent_command' AS (a) +FROM t +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkException +Subprocess exited with status 127. Error: /bin/bash: some_non_existent_command: command not found + + +-- !query +SELECT TRANSFORM(a) +USING 'python some_non_existent_file' AS (a) +FROM t +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkException +Subprocess exited with status 2. Error: python: can't open file 'some_non_existent_file': [Errno 2] No such file or directory + + +-- !query +SELECT a, b, decode(c, 'UTF-8'), d, e, f, g, h, i, j, k, l FROM ( + SELECT TRANSFORM(a, b, c, d, e, f, g, h, i, j, k, l) + USING 'cat' AS ( +a string, +b boolean, +c binary, +d tinyint, +e int, +f smallint, +g long, +h float, +i double, +j decimal(38, 18), +k timestamp, +l date) + FROM t +) tmp +-- !query schema +struct +-- !query output +1 trueSpark SQL 1 1 100 1 1.0 1.0 1.001997-01-02 00:00:00 2000-04-01 +2 false Spark SQL 2 2 200 2 2.0 2.0 2.001997-01-02 03:04:05 2000-04-02 +3 trueSpark SQL 3 3 300 3 3.0 3.0 3.001997-02-10 17:32:01 2000-04-03 + + +-- !query +SELECT a, b, decode(c, 'UTF-8'), d, e, f, g, h, i, j, k, l FROM ( + SELECT TRANSFORM(a, b, c, d, e, f, g, h, i, j, k, l) + USING 'cat' AS ( +a string, +b string, +c string, +d string, +e string, +f string, +g string, +h string, +i string, +j string, +k string, +l string) + FROM t +) tmp +-- !query schema +struct +-- !query output +1 trueSpark SQL 1 1 100 1 1.0 1.0 1 1997-01-02 00:00:00 2000-04-01 +2 false Spark SQL 2 2 200 2 2.0 2.0 2 1997-01-02 03:04:05 2000-04-02 +3 trueSpark SQL 3 3 300 3 3.0 3.0 3 1997-02-10 17:32:01 2000-04-03 + + +-- !query +SELECT TRANSFORM(a) +USING 'cat' +FROM t +-- !query schema +struct<> +-- !query output +java.lang.ArrayIndexOutOfBoundsException +1 + + +-- !query +SELECT TRANSFORM(a, b) +USING 'cat' +FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c) +USING 'cat' +FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c, d, e, f, g, h, i) +USING 'cat' AS (a int, b short, c long, d byte, e float, f double, g decimal(38, 18), h date, i timestamp) +FROM VALUES +('a','','1231a','a','213.21a','213.21a','0a.21d','2000-04-01123','1997-0102 00:00:') tmp(a, b, c, d, e, f, g, h, i) +-- !query schema +struct +-- !query output +NULL NULLNULLNULLNULLNULLNULLNULLNULL + + +-- !query +SELECT TRANSFORM(b, max(a), sum(f)) +USING 'cat' AS (a, b) +FROM t +GROUP BY b +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.catalyst.parser.ParseException + +mismatched input 'GROUP' expecting {, ';'}(line 4, pos 0) + +== SQL == +SELECT TRANSFORM(b, max(a), sum(f)) +USING 'cat' AS (a, b) +FROM t +GROUP BY b +^^^ + + +-- !query +MAP a, b USING 'cat' AS (a, b) FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +REDUCE a, b USING 'cat' AS (a, b) FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c, null) + ROW FORMAT DELIMITED + FIELDS TERMINATED BY '|' + LINES TERMINATED BY '\n' + NULL DEFINED AS 'NULL' +USING 'cat' AS (a, b, c, d) Review comment: > Also, could you add test cases
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins removed a comment on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-678902339 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-678902339 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AngersZh commented on a change in pull request #29414: URL: https://github.com/apache/spark/pull/29414#discussion_r475341428 ## File path: sql/core/src/test/resources/sql-tests/results/transform.sql.out ## @@ -0,0 +1,224 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 15 + + +-- !query +CREATE OR REPLACE TEMPORARY VIEW t AS SELECT * FROM VALUES +('1', true, unhex('537061726B2053514C'), tinyint(1), 1, smallint(100), bigint(1), float(1.0), 1.0, Decimal(1.0), timestamp('1997-01-02'), date('2000-04-01')), +('2', false, unhex('537061726B2053514C'), tinyint(2), 2, smallint(200), bigint(2), float(2.0), 2.0, Decimal(2.0), timestamp('1997-01-02 03:04:05'), date('2000-04-02')), +('3', true, unhex('537061726B2053514C'), tinyint(3), 3, smallint(300), bigint(3), float(3.0), 3.0, Decimal(3.0), timestamp('1997-02-10 17:32:01-08'), date('2000-04-03')) +AS t(a, b, c, d, e, f, g, h, i, j, k, l) +-- !query schema +struct<> +-- !query output + + + +-- !query +SELECT TRANSFORM(a) +USING 'cat' AS (a) +FROM t +-- !query schema +struct +-- !query output +1 +2 +3 + + +-- !query +SELECT TRANSFORM(a) +USING 'some_non_existent_command' AS (a) +FROM t +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkException +Subprocess exited with status 127. Error: /bin/bash: some_non_existent_command: command not found + + +-- !query +SELECT TRANSFORM(a) +USING 'python some_non_existent_file' AS (a) +FROM t +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkException +Subprocess exited with status 2. Error: python: can't open file 'some_non_existent_file': [Errno 2] No such file or directory + + +-- !query +SELECT a, b, decode(c, 'UTF-8'), d, e, f, g, h, i, j, k, l FROM ( + SELECT TRANSFORM(a, b, c, d, e, f, g, h, i, j, k, l) + USING 'cat' AS ( +a string, +b boolean, +c binary, +d tinyint, +e int, +f smallint, +g long, +h float, +i double, +j decimal(38, 18), +k timestamp, +l date) + FROM t +) tmp +-- !query schema +struct +-- !query output +1 trueSpark SQL 1 1 100 1 1.0 1.0 1.001997-01-02 00:00:00 2000-04-01 +2 false Spark SQL 2 2 200 2 2.0 2.0 2.001997-01-02 03:04:05 2000-04-02 +3 trueSpark SQL 3 3 300 3 3.0 3.0 3.001997-02-10 17:32:01 2000-04-03 + + +-- !query +SELECT a, b, decode(c, 'UTF-8'), d, e, f, g, h, i, j, k, l FROM ( + SELECT TRANSFORM(a, b, c, d, e, f, g, h, i, j, k, l) + USING 'cat' AS ( +a string, +b string, +c string, +d string, +e string, +f string, +g string, +h string, +i string, +j string, +k string, +l string) + FROM t +) tmp +-- !query schema +struct +-- !query output +1 trueSpark SQL 1 1 100 1 1.0 1.0 1 1997-01-02 00:00:00 2000-04-01 +2 false Spark SQL 2 2 200 2 2.0 2.0 2 1997-01-02 03:04:05 2000-04-02 +3 trueSpark SQL 3 3 300 3 3.0 3.0 3 1997-02-10 17:32:01 2000-04-03 + + +-- !query +SELECT TRANSFORM(a) +USING 'cat' +FROM t +-- !query schema +struct<> +-- !query output +java.lang.ArrayIndexOutOfBoundsException +1 + + +-- !query +SELECT TRANSFORM(a, b) +USING 'cat' +FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c) +USING 'cat' +FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c, d, e, f, g, h, i) +USING 'cat' AS (a int, b short, c long, d byte, e float, f double, g decimal(38, 18), h date, i timestamp) +FROM VALUES +('a','','1231a','a','213.21a','213.21a','0a.21d','2000-04-01123','1997-0102 00:00:') tmp(a, b, c, d, e, f, g, h, i) +-- !query schema +struct +-- !query output +NULL NULLNULLNULLNULLNULLNULLNULLNULL + + +-- !query +SELECT TRANSFORM(b, max(a), sum(f)) +USING 'cat' AS (a, b) +FROM t +GROUP BY b +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.catalyst.parser.ParseException + +mismatched input 'GROUP' expecting {, ';'}(line 4, pos 0) + +== SQL == +SELECT TRANSFORM(b, max(a), sum(f)) +USING 'cat' AS (a, b) +FROM t +GROUP BY b +^^^ + + +-- !query +MAP a, b USING 'cat' AS (a, b) FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +REDUCE a, b USING 'cat' AS (a, b) FROM t +-- !query schema +struct +-- !query output +1 true +2 false +3 true + + +-- !query +SELECT TRANSFORM(a, b, c, null) + ROW FORMAT DELIMITED + FIELDS TERMINATED BY '|' + LINES TERMINATED BY '\n' + NULL DEFINED AS 'NULL' +USING 'cat' AS (a, b, c, d) Review comment: > Also, could you add test cases
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
AmplabJenkins removed a comment on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678900712 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127824/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
AmplabJenkins removed a comment on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678900708 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
SparkQA removed a comment on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678870379 **[Test build #127824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127824/testReport)** for PR 29516 at commit [`f3d14c6`](https://github.com/apache/spark/commit/f3d14c61550877a6d3b2df15954fee30c8546fa5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
AmplabJenkins commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678900708 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
huaxingao commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678899420 I don't know how to merge this one. I got the following message: ``` Pull request 29501 is not mergeable in its current form. Continue? (experts only!) (y/n): ``` I am not sure if I should continue. Do I need to reopen the PR first? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29513: [SPARK-32646][SQL][3.0][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis
AmplabJenkins removed a comment on pull request #29513: URL: https://github.com/apache/spark/pull/29513#issuecomment-678897328 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29513: [SPARK-32646][SQL][3.0][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis
AmplabJenkins commented on pull request #29513: URL: https://github.com/apache/spark/pull/29513#issuecomment-678897328 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29513: [SPARK-32646][SQL][3.0][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis
SparkQA commented on pull request #29513: URL: https://github.com/apache/spark/pull/29513#issuecomment-678897095 **[Test build #127828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127828/testReport)** for PR 29513 at commit [`a19e523`](https://github.com/apache/spark/commit/a19e523a02b7ef39213aabb130d554839a50beeb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29513: [SPARK-32646][SQL][3.0][test-hadoop2.7][test-hive1.2] ORC predicate pushdown should work with case-insensitive analysis
viirya commented on pull request #29513: URL: https://github.com/apache/spark/pull/29513#issuecomment-678896440 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29527: [SPARK-32664] fixes log level
AmplabJenkins removed a comment on pull request #29527: URL: https://github.com/apache/spark/pull/29527#issuecomment-678895387 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29527: [SPARK-32664] fixes log level
AmplabJenkins commented on pull request #29527: URL: https://github.com/apache/spark/pull/29527#issuecomment-678895672 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29527: [SPARK-32664] fixes log level
AmplabJenkins commented on pull request #29527: URL: https://github.com/apache/spark/pull/29527#issuecomment-678895387 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
srowen commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678895333 Go ahead yes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dmoore62 opened a new pull request #29527: [SPARK-32664] fixes log level
dmoore62 opened a new pull request #29527: URL: https://github.com/apache/spark/pull/29527 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
huaxingao commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678893930 @srowen I will merge into 3.0? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29524: [SPARK-32092][ML][PySpark][3.0] Removed foldCol related code
huaxingao commented on pull request #29524: URL: https://github.com/apache/spark/pull/29524#issuecomment-678893290 Merged to 3.0. Thank you all! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #29524: [SPARK-32092][ML][PySpark][3.0] Removed foldCol related code
huaxingao closed pull request #29524: URL: https://github.com/apache/spark/pull/29524 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #29228: URL: https://github.com/apache/spark/pull/29228#issuecomment-67647 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
AmplabJenkins commented on pull request #29509: URL: https://github.com/apache/spark/pull/29509#issuecomment-67621 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
AmplabJenkins removed a comment on pull request #29509: URL: https://github.com/apache/spark/pull/29509#issuecomment-67621 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins commented on pull request #29228: URL: https://github.com/apache/spark/pull/29228#issuecomment-67647 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
SparkQA commented on pull request #29228: URL: https://github.com/apache/spark/pull/29228#issuecomment-67372 **[Test build #127827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127827/testReport)** for PR 29228 at commit [`86dc8f8`](https://github.com/apache/spark/commit/86dc8f81702f6694bd17d4578d81133ce0731ac5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
SparkQA commented on pull request #29509: URL: https://github.com/apache/spark/pull/29509#issuecomment-67349 **[Test build #127826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127826/testReport)** for PR 29509 at commit [`1b105e3`](https://github.com/apache/spark/commit/1b105e3a9ca090fd134b8eebce5ed714d8567a1e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29355: [SPARK-32552][SQL][DOCS]Complete the documentation for Table-valued Function
huaxingao commented on pull request #29355: URL: https://github.com/apache/spark/pull/29355#issuecomment-678887751 Thanks a lot! @maropu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475325934 ## File path: core/src/test/scala/org/apache/spark/deploy/history/HybridStoreSuite.scala ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.util.NoSuchElementException +import java.util.concurrent.LinkedBlockingQueue + +import org.apache.commons.io.FileUtils +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.kvstore._ + +class HybridStoreSuite extends SparkFunSuite with BeforeAndAfter { + + private var db: LevelDB = _ + private var dbpath: File = _ + + before { +dbpath = File.createTempFile("test.", ".ldb") +dbpath.delete() +db = new LevelDB(dbpath, new KVStoreScalaSerializer()) + } + + after { +if (db != null) { + db.close() +} +if (dbpath != null) { + FileUtils.deleteQuietly(dbpath) +} + } + + test("test multiple objects write read delete") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +val t2 = createCustomType1(2) + +intercept[NoSuchElementException] { + store.read(t1.getClass(), t1.key) +} + +store.write(t1) +store.write(t2) +store.delete(t2.getClass(), t2.key) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + intercept[NoSuchElementException] { + store.read(t2.getClass(), t2.key) + } + assert(store.read(t1.getClass(), t1.key) === t1) + assert(store.count(t1.getClass()) === 1L) +} + } + + test("test metadata") { +val store = createHybridStore() +assert(store.getMetadata(classOf[CustomType1]) === null) + +val t1 = createCustomType1(1) +store.setMetadata(t1) +assert(store.getMetadata(classOf[CustomType1]) === t1) + +// Switch to LevelDB and set a new metadata +switchHybridStore(store) + +val t2 = createCustomType1(2) +store.setMetadata(t2) +assert(store.getMetadata(classOf[CustomType1]) === t2) + } + + test("test update") { +val store = createHybridStore() +val t = createCustomType1(1) + +store.write(t) +t.name = "name2" +store.write(t) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.count(t.getClass()) === 1L) + assert(store.read(t.getClass(), t.key) === t) +} + } + + test("test basic iteration") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +store.write(t1) +val t2 = createCustomType1(2) +store.write(t2) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.view(t1.getClass()).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).skip(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).skip(1).max(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).first(t1.key).max(1).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).first(t2.key).max(1).iterator().next().id === t2.id) +} + } + + test("test delete after switch") { +val store = createHybridStore() +val t = createCustomType1(1) +store.write(t) +switchHybridStore(store) +intercept[IllegalStateException] { + store.delete(t.getClass(), t.key) +} + } + + test("test klassMap") { +val store = createHybridStore() +val t1 = createCustomType1(1) +store.write(t1) +assert(store.klassMap.size === 1) +val t2 = new CustomType2("key2") +store.write(t2) +assert(store.klassMap.size === 2) + +switchHybridStore(store) +val t3 = new CustomType3("key3") +store.write(t3) +// Cannot put new klass to klassMap after the switching starts +assert(store.klassMap.size === 2) + } + + private def createHybridStore(): HybridStore = { +val store = new HybridStore() +store.setLevelDB(db) +store + } + + private def createCustomType1(i: Int): CustomType1 = { +new CustomType1("key" + i, "id" + i,
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475325165 ## File path: core/src/test/scala/org/apache/spark/deploy/history/HybridStoreSuite.scala ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.util.NoSuchElementException +import java.util.concurrent.LinkedBlockingQueue + +import org.apache.commons.io.FileUtils +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.kvstore._ + +class HybridStoreSuite extends SparkFunSuite with BeforeAndAfter { + + private var db: LevelDB = _ + private var dbpath: File = _ + + before { +dbpath = File.createTempFile("test.", ".ldb") +dbpath.delete() +db = new LevelDB(dbpath, new KVStoreScalaSerializer()) + } + + after { +if (db != null) { + db.close() +} +if (dbpath != null) { + FileUtils.deleteQuietly(dbpath) +} + } + + test("test multiple objects write read delete") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +val t2 = createCustomType1(2) + +intercept[NoSuchElementException] { + store.read(t1.getClass(), t1.key) +} + +store.write(t1) +store.write(t2) +store.delete(t2.getClass(), t2.key) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + intercept[NoSuchElementException] { + store.read(t2.getClass(), t2.key) + } + assert(store.read(t1.getClass(), t1.key) === t1) + assert(store.count(t1.getClass()) === 1L) +} + } + + test("test metadata") { +val store = createHybridStore() +assert(store.getMetadata(classOf[CustomType1]) === null) + +val t1 = createCustomType1(1) +store.setMetadata(t1) +assert(store.getMetadata(classOf[CustomType1]) === t1) + +// Switch to LevelDB and set a new metadata +switchHybridStore(store) + +val t2 = createCustomType1(2) +store.setMetadata(t2) +assert(store.getMetadata(classOf[CustomType1]) === t2) + } + + test("test update") { +val store = createHybridStore() +val t = createCustomType1(1) + +store.write(t) +t.name = "name2" +store.write(t) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.count(t.getClass()) === 1L) + assert(store.read(t.getClass(), t.key) === t) +} + } + + test("test basic iteration") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +store.write(t1) +val t2 = createCustomType1(2) +store.write(t2) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.view(t1.getClass()).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).skip(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).skip(1).max(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).first(t1.key).max(1).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).first(t2.key).max(1).iterator().next().id === t2.id) +} + } + + test("test delete after switch") { +val store = createHybridStore() +val t = createCustomType1(1) +store.write(t) +switchHybridStore(store) +intercept[IllegalStateException] { + store.delete(t.getClass(), t.key) +} + } + + test("test klassMap") { +val store = createHybridStore() +val t1 = createCustomType1(1) +store.write(t1) +assert(store.klassMap.size === 1) +val t2 = new CustomType2("key2") +store.write(t2) +assert(store.klassMap.size === 2) + +switchHybridStore(store) +val t3 = new CustomType3("key3") +store.write(t3) +// Cannot put new klass to klassMap after the switching starts +assert(store.klassMap.size === 2) + } + + private def createHybridStore(): HybridStore = { +val store = new HybridStore() +store.setLevelDB(db) +store + } + + private def createCustomType1(i: Int): CustomType1 = { +new CustomType1("key" + i, "id" + i,
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475325041 ## File path: core/src/test/scala/org/apache/spark/deploy/history/HybridStoreSuite.scala ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.util.NoSuchElementException +import java.util.concurrent.LinkedBlockingQueue + +import org.apache.commons.io.FileUtils +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.kvstore._ + +class HybridStoreSuite extends SparkFunSuite with BeforeAndAfter { + + private var db: LevelDB = _ + private var dbpath: File = _ + + before { +dbpath = File.createTempFile("test.", ".ldb") +dbpath.delete() +db = new LevelDB(dbpath, new KVStoreScalaSerializer()) + } + + after { +if (db != null) { + db.close() +} +if (dbpath != null) { + FileUtils.deleteQuietly(dbpath) +} + } + + test("test multiple objects write read delete") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +val t2 = createCustomType1(2) + +intercept[NoSuchElementException] { + store.read(t1.getClass(), t1.key) +} + +store.write(t1) +store.write(t2) +store.delete(t2.getClass(), t2.key) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + intercept[NoSuchElementException] { + store.read(t2.getClass(), t2.key) Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475325096 ## File path: core/src/test/scala/org/apache/spark/deploy/history/HybridStoreSuite.scala ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.util.NoSuchElementException +import java.util.concurrent.LinkedBlockingQueue + +import org.apache.commons.io.FileUtils +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.kvstore._ + +class HybridStoreSuite extends SparkFunSuite with BeforeAndAfter { + + private var db: LevelDB = _ + private var dbpath: File = _ + + before { +dbpath = File.createTempFile("test.", ".ldb") +dbpath.delete() +db = new LevelDB(dbpath, new KVStoreScalaSerializer()) + } + + after { +if (db != null) { + db.close() +} +if (dbpath != null) { + FileUtils.deleteQuietly(dbpath) +} + } + + test("test multiple objects write read delete") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +val t2 = createCustomType1(2) + +intercept[NoSuchElementException] { + store.read(t1.getClass(), t1.key) +} + +store.write(t1) +store.write(t2) +store.delete(t2.getClass(), t2.key) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + intercept[NoSuchElementException] { + store.read(t2.getClass(), t2.key) + } + assert(store.read(t1.getClass(), t1.key) === t1) + assert(store.count(t1.getClass()) === 1L) +} + } + + test("test metadata") { +val store = createHybridStore() +assert(store.getMetadata(classOf[CustomType1]) === null) + +val t1 = createCustomType1(1) +store.setMetadata(t1) +assert(store.getMetadata(classOf[CustomType1]) === t1) + +// Switch to LevelDB and set a new metadata +switchHybridStore(store) + +val t2 = createCustomType1(2) +store.setMetadata(t2) +assert(store.getMetadata(classOf[CustomType1]) === t2) + } + + test("test update") { +val store = createHybridStore() +val t = createCustomType1(1) + +store.write(t) +t.name = "name2" +store.write(t) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.count(t.getClass()) === 1L) + assert(store.read(t.getClass(), t.key) === t) +} + } + + test("test basic iteration") { +val store = createHybridStore() + +val t1 = createCustomType1(1) +store.write(t1) +val t2 = createCustomType1(2) +store.write(t2) + +Seq(false, true).foreach { switch => + if (switch) switchHybridStore(store) + + assert(store.view(t1.getClass()).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).skip(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).skip(1).max(1).iterator().next().id === t2.id) + assert(store.view(t1.getClass()).first(t1.key).max(1).iterator().next().id === t1.id) + assert(store.view(t1.getClass()).first(t2.key).max(1).iterator().next().id === t2.id) +} + } + + test("test delete after switch") { +val store = createHybridStore() +val t = createCustomType1(1) +store.write(t) +switchHybridStore(store) +intercept[IllegalStateException] { + store.delete(t.getClass(), t.key) +} + } + + test("test klassMap") { +val store = createHybridStore() +val t1 = createCustomType1(1) +store.write(t1) +assert(store.klassMap.size === 1) +val t2 = new CustomType2("key2") +store.write(t2) +assert(store.klassMap.size === 2) + +switchHybridStore(store) +val t3 = new CustomType3("key3") +store.write(t3) +// Cannot put new klass to klassMap after the switching starts +assert(store.klassMap.size === 2) + } + + private def createHybridStore(): HybridStore = { +val store = new HybridStore() +store.setLevelDB(db) +store + } + + private def createCustomType1(i: Int): CustomType1 = { +new CustomType1("key" + i, "id" + i,
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475325018 ## File path: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala ## @@ -1509,13 +1513,18 @@ class FsHistoryProviderSuite extends SparkFunSuite with Matchers with Logging { new FileOutputStream(file).close() } - private def createTestConf(inMemory: Boolean = false): SparkConf = { + private def createTestConf( + inMemory: Boolean = false, + useHybridStore: Boolean = false): SparkConf = { val conf = new SparkConf() .set(HISTORY_LOG_DIR, testDir.getAbsolutePath()) .set(FAST_IN_PROGRESS_PARSING, true) if (!inMemory) { conf.set(LOCAL_STORE_DIR, Utils.createTempDir().getAbsolutePath()) + if (useHybridStore) { Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
huaxingao commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678885037 I think we need to put the fix in 3.0, because in the case of data is already cached, this fix makes 3.0.0 behave the same as 2.4. In 2.4 ``` cache norm in memory ``` currently in 3.0 ``` always cache zipped data (data and norm) regardless if original data is cached or not ``` After this fix ``` if (data is cached) cache norm in memory and disk else cache zipped data (data and norm) ``` The double caching in current 3.0 may cause performance degradation from 2.4 to 3.0, so we want to put the fix in 3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
cloud-fan commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678884094 good catch! LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] baohe-zhang commented on a change in pull request #29509: [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager
baohe-zhang commented on a change in pull request #29509: URL: https://github.com/apache/spark/pull/29509#discussion_r475320679 ## File path: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ## @@ -1214,8 +1214,8 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) // Use InMemoryStore to rebuild app store while (hybridStore == null) { // A RuntimeException will be thrown if the heap memory is not sufficient - memoryManager.lease(appId, attempt.info.attemptId, reader.totalSize, Review comment: It's related to the test code, but my original thought is that passing the actual amount of memory, instead of filesize to memoryManager.lease() would make more sense. Although exposing inner details to fsHistoryProvider is not ideal. What's your opinion? Should I revert this change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
LuciferYang commented on pull request #29000: URL: https://github.com/apache/spark/pull/29000#issuecomment-678872666 @Ngone51 Could you please review it again ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agrawaldevesh commented on a change in pull request #29452: [SPARK-32643][CORE][K8s] Consolidate state decommissioning in the TaskSchedulerImpl realm
agrawaldevesh commented on a change in pull request #29452: URL: https://github.com/apache/spark/pull/29452#discussion_r475310046 ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala ## @@ -18,11 +18,22 @@ package org.apache.spark.scheduler /** - * Provides more detail when an executor is being decommissioned. + * Message providing more detail when an executor is being decommissioned. * @param message Human readable reason for why the decommissioning is happening. * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in other places) is * being decommissioned too. Used to infer if the shuffle data might * be lost even if the external shuffle service is enabled. */ private[spark] case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: Boolean) + +/** + * State related to decommissioning that is kept by the TaskSchedulerImpl. This state is derived + * from the info message above but it is kept distinct to allow the state to evolve independently + * from the message. + */ +case class ExecutorDecommissionState( +message: String, Review comment: So far that need hasn't come up :-) But when it does, we can easily add it. ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala ## @@ -18,11 +18,22 @@ package org.apache.spark.scheduler /** - * Provides more detail when an executor is being decommissioned. + * Message providing more detail when an executor is being decommissioned. * @param message Human readable reason for why the decommissioning is happening. * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in other places) is * being decommissioned too. Used to infer if the shuffle data might * be lost even if the external shuffle service is enabled. */ private[spark] case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: Boolean) + +/** + * State related to decommissioning that is kept by the TaskSchedulerImpl. This state is derived + * from the info message above but it is kept distinct to allow the state to evolve independently + * from the message. + */ +case class ExecutorDecommissionState( +message: String, +// Timestamp the decommissioning commenced in millis since epoch of the driver's clock Review comment: Yeah, it is used to compute the formerly known `tidToExecutorKillTimeMapping` (search for this on the code on the left). It's not so much for expiry of the decommission state, for which we are using the cache that you suggested in the previous PR. Good suggestion to add some idea of how it is used. I will add a comment. ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -1123,14 +1127,6 @@ private[spark] class TaskSetManager( def executorDecommission(execId: String): Unit = { recomputeLocality() -if (speculationEnabled) { Review comment: This was used as an efficiency improvement: To not do this book keeping in the driver if the speculation is not enabled. Save both some cpu cycles and memory. Now this check is done in checkSpeculatableTasks, which is not even called if speculation is disabled. And thus automatically begets this efficiency improvement. This is a positive side effect of changing the book keeping by merging tidToExecutorKillTimeMapping into executorDecommissionState. In the meanwhile I will hunt for a suitable test that adds some coverage here or consider adding one. ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -926,18 +926,21 @@ private[spark] class TaskSchedulerImpl( // and some of those can have isHostDecommissioned false. We merge them such that // if we heard isHostDecommissioned ever true, then we keep that one since it is // most likely coming from the cluster manager and thus authoritative -val oldDecomInfo = executorsPendingDecommission.get(executorId) -if (!oldDecomInfo.exists(_.isHostDecommissioned)) { - executorsPendingDecommission(executorId) = decommissionInfo +val oldDecomState = executorsPendingDecommission.get(executorId) +if (!oldDecomState.exists(_.isHostDecommissioned)) { + executorsPendingDecommission(executorId) = ExecutorDecommissionState( +decommissionInfo.message, +oldDecomState.map(_.startTime).getOrElse(clock.getTimeMillis()), Review comment: Sure, I will tweak the comment above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific
[GitHub] [spark] zhengruifeng commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
zhengruifeng commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678871882 this double caching did not exist in 2.4, and it was first introduced in 3.0.0, so I tend to put it into RC2. How doyou think about it? @huaxingao This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
AngersZh commented on a change in pull request #29526: URL: https://github.com/apache/spark/pull/29526#discussion_r475310377 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ## @@ -176,9 +176,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { l.resolve(fsRelation.dataSchema, fsRelation.sparkSession.sessionState.analyzer.resolver) // Partition keys are not available in the statistics of the files. + val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains) Review comment: > Sure. Added. > > It only happens in hive-1.2 profile, because for hive-2.3 we go for a different path to create pushed down filters. In that path, we have checked if an attribute is in the field map. LGTM,when I doing that pr https://github.com/apache/spark/pull/29406, I have thought that : from the code, seems dataColumns won't have partition col. Thanks for these fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29507: [SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query
AmplabJenkins removed a comment on pull request #29507: URL: https://github.com/apache/spark/pull/29507#issuecomment-678870683 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29507: [SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query
AmplabJenkins commented on pull request #29507: URL: https://github.com/apache/spark/pull/29507#issuecomment-678870683 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
SparkQA commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678870379 **[Test build #127824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127824/testReport)** for PR 29516 at commit [`f3d14c6`](https://github.com/apache/spark/commit/f3d14c61550877a6d3b2df15954fee30c8546fa5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29507: [SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query
SparkQA commented on pull request #29507: URL: https://github.com/apache/spark/pull/29507#issuecomment-678870388 **[Test build #127825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127825/testReport)** for PR 29507 at commit [`e03e64d`](https://github.com/apache/spark/commit/e03e64dbc44660fbcd2183e2cdc222ebccbcd7c8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
srowen commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678869607 Do we need it in 3.0? I'm not super against it but it's more of an improvement, optimization, not a bug fix This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
srowen commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678869497 BTW I think we may still have a real test failure here, I'm looking into it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
AmplabJenkins removed a comment on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678869069 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
AmplabJenkins commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678869069 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #29516: [WIP][SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
zhengruifeng commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-678868865 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
Ngone51 commented on a change in pull request #29270: URL: https://github.com/apache/spark/pull/29270#discussion_r475307366 ## File path: sql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q10.sf100/explain.txt ## @@ -0,0 +1,286 @@ +== Physical Plan == +TakeOrderedAndProject (52) ++- * HashAggregate (51) + +- Exchange (50) + +- * HashAggregate (49) + +- * Project (48) ++- * BroadcastHashJoin Inner BuildLeft (47) + :- BroadcastExchange (43) + : +- * Project (42) + : +- * BroadcastHashJoin Inner BuildRight (41) + ::- * Project (35) + :: +- SortMergeJoin LeftSemi (34) + :: :- SortMergeJoin LeftSemi (25) + :: : :- * Sort (5) + :: : : +- Exchange (4) + :: : : +- * Filter (3) + :: : :+- * ColumnarToRow (2) + :: : : +- Scan parquet default.customer (1) + :: : +- * Sort (24) + :: : +- Exchange (23) + :: :+- Union (22) + :: : :- * Project (15) + :: : : +- * BroadcastHashJoin Inner BuildRight (14) + :: : : :- * Filter (8) + :: : : : +- * ColumnarToRow (7) + :: : : : +- Scan parquet default.web_sales (6) + :: : : +- BroadcastExchange (13) + :: : :+- * Project (12) + :: : : +- * Filter (11) + :: : : +- * ColumnarToRow (10) + :: : : +- Scan parquet default.date_dim (9) + :: : +- * Project (21) + :: : +- * BroadcastHashJoin Inner BuildRight (20) + :: : :- * Filter (18) + :: : : +- * ColumnarToRow (17) + :: : : +- Scan parquet default.catalog_sales (16) + :: : +- ReusedExchange (19) + :: +- * Sort (33) + ::+- Exchange (32) + :: +- * Project (31) + :: +- * BroadcastHashJoin Inner BuildRight (30) + :: :- * Filter (28) + :: : +- * ColumnarToRow (27) + :: : +- Scan parquet default.store_sales (26) + :: +- ReusedExchange (29) + :+- BroadcastExchange (40) + : +- * Project (39) + : +- * Filter (38) + : +- * ColumnarToRow (37) + :+- Scan parquet default.customer_address (36) + +- * Filter (46) + +- * ColumnarToRow (45) + +- Scan parquet default.customer_demographics (44) + + +(1) Scan parquet default.customer +Output [3]: [c_customer_sk#1, c_current_cdemo_sk#2, c_current_addr_sk#3] +Batched: true +Location: InMemoryFileIndex [file:/Users/yi.wu/IdeaProjects/spark/sql/core/spark-warehouse/org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite/customer] Review comment: Oh, I see. Thanks for pointing it out. I'll make a follow-up soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
LuciferYang commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-678867553 @srowen @cloud-fan @HyukjinKwon @dongjoon-hyun Thank you for your review~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #29501: [SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans
zhengruifeng commented on pull request #29501: URL: https://github.com/apache/spark/pull/29501#issuecomment-678867157 @srowen @huaxingao Thanks for reviewing! would you mind to help backporting this to 3.0? I do not have a computer to do this right now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29477: [SPARK-32661][K8S] Spark executors should request extra memory for off-heap allocations.
AmplabJenkins removed a comment on pull request #29477: URL: https://github.com/apache/spark/pull/29477#issuecomment-678860803 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/32446/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29477: [SPARK-32661][K8S] Spark executors should request extra memory for off-heap allocations.
SparkQA commented on pull request #29477: URL: https://github.com/apache/spark/pull/29477#issuecomment-678860789 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32446/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29477: [SPARK-32661][K8S] Spark executors should request extra memory for off-heap allocations.
AmplabJenkins commented on pull request #29477: URL: https://github.com/apache/spark/pull/29477#issuecomment-678860798 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29477: [SPARK-32661][K8S] Spark executors should request extra memory for off-heap allocations.
AmplabJenkins removed a comment on pull request #29477: URL: https://github.com/apache/spark/pull/29477#issuecomment-678860798 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins removed a comment on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678860152 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
AmplabJenkins commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678860152 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive
SparkQA commented on pull request #29421: URL: https://github.com/apache/spark/pull/29421#issuecomment-678859868 **[Test build #127823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127823/testReport)** for PR 29421 at commit [`5f03222`](https://github.com/apache/spark/commit/5f032229ca2c457753622e21e22d92848de24fa6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
SparkQA commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678845961 **[Test build #127820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127820/testReport)** for PR 29526 at commit [`b37f694`](https://github.com/apache/spark/commit/b37f6949f1f7c4c6d2264559402a963eb077990d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29505: [SPARK-32648][SS] Remove unused DELETE_ACTION in FileStreamSinkLog
dongjoon-hyun commented on pull request #29505: URL: https://github.com/apache/spark/pull/29505#issuecomment-678846054 Thank you and welcome, @michal-wieleba . You are added to the Apache Spark contributor group and SPARK-32648 is assigned to you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
viirya commented on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678845494 Yeah, we don't run hive-1.2 test usually except we know the diff touches hive 1.2 code path. For these failed tests, they don't touch the code directly, but affect it indirectly... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #29505: [SPARK-32648][SS] Remove unused DELETE_ACTION in FileStreamSinkLog
dongjoon-hyun closed pull request #29505: URL: https://github.com/apache/spark/pull/29505 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu edited a comment on pull request #29526: [SPARK-32352][SQL][FOLLOW-UP][test-hadoop2.7][test-hive1.2] Exclude partition columns from data columns
maropu edited a comment on pull request #29526: URL: https://github.com/apache/spark/pull/29526#issuecomment-678845205 Nice, thanks for the swift fixes, @viirya! Anyway, it seems we didn't notice this test failure for 10+ days, so we need to carefully check the branches w/hive-1.2... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org