[GitHub] spark pull request #16154: [SPARK-17822] [R] Make JVMObjectTracker a member ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16154#discussion_r91018557 --- Diff: core/src/main/scala/org/apache/spark/api/r/JVMObjectTracker.scala --- @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.util.concurrent.atomic.AtomicInteger +import java.util.concurrent.ConcurrentHashMap + +/** JVM object ID wrapper */ +private[r] case class JVMObjectId(id: String) --- End diff -- Is there a reason we should use Strings instead of Long - Just wondering as this will also add to memory overhead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16154: [SPARK-17822] [R] Make JVMObjectTracker a member ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16154#discussion_r91145714 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -247,7 +247,7 @@ private[sql] object SQLUtils extends Logging { dataType match { case 's' => // Read StructType for DataFrame -val fields = SerDe.readList(dis).asInstanceOf[Array[Object]] +val fields = SerDe.readList(dis, jvmObjectTracker = null).asInstanceOf[Array[Object]] --- End diff -- The two changes below are functions used in serializing and deserializing StructType from R to JVM (when calling collect or parallelize) Again I think since StructTypes themselves can't contain other JVM references I think its fine, but I'm less sure about this. (i.e what happens when we say embed an array into the structType). Another option here is to figure out which RBackend is calling this function and use its JVMObjectTracker ? Do we want to support cases where there are multiple active RBackends ? If not we can denote an `activeRBackend` and then use its jvm object tracker in these places ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16154: [SPARK-17822] [R] Make JVMObjectTracker a member ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16154#discussion_r91136204 --- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala --- @@ -143,12 +142,8 @@ private[r] class RBackendHandler(server: RBackend) val cls = if (isStatic) { Utils.classForName(objId) } else { -JVMObjectTracker.get(objId) match { - case None => throw new IllegalArgumentException("Object not found " + objId) - case Some(o) => -obj = o -o.getClass -} +obj = server.jvmObjectTracker.get(JVMObjectId(objId)) --- End diff -- This is minor but the exception now thrown will be a `NullPointerException` in place of what it was before ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16154: [SPARK-17822] [R] Make JVMObjectTracker a member ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16154#discussion_r91140911 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -158,7 +158,7 @@ private[sql] object SQLUtils extends Logging { val dis = new DataInputStream(bis) val num = SerDe.readInt(dis) Row.fromSeq((0 until num).map { i => - doConversion(SerDe.readObject(dis), schema.fields(i).dataType) + doConversion(SerDe.readObject(dis, jvmObjectTracker = null), schema.fields(i).dataType) --- End diff -- I think this one is safe because this is called on the output from the R UDF and the R UDF is assumed to return native types (i.e. no references to other JVM objects) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15923: [SPARK-4105] retry the fetch or stage if shuffle block i...
Github user davies commented on the issue: https://github.com/apache/spark/pull/15923 ping @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkC...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/16174#discussion_r91141519 --- Diff: streaming/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java --- @@ -28,6 +29,7 @@ @Before public void setUp() { +SparkContext$.MODULE$.stopActiveContext(); --- End diff -- Can't you just call `SparkContext.stopActiveContext()` from Java? I see static methods in the bytecode for things like `jarOfObject`. The line is also indented incorrectly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16161 **[Test build #69741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69741/consoleFull)** for PR 16161 at commit [`f325e75`](https://github.com/apache/spark/commit/f325e75ab61ac903a605090b756461f4b50b64ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkC...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/16174#discussion_r91144141 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2350,6 +2350,16 @@ object SparkContext extends Logging { } } + private[spark] def getActiveContext(): Option[SparkContext] = { +SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { + Option(activeContext.get()) +} + } + + private[spark] def stopActiveContext(): Unit = { --- End diff -- I don't know. I'm not a big fan of the approach you're taking here: calling this method before running tests. That feels like a sledgehammer to fix flaky tests. I think it would be better for test code to be more careful about cleaning after itself. Kinda like most tests in spark-core use `LocalSparkContext` to more or less automatically do that without the need for these methods. The `ReuseableSparkContext` trait you have is a step in that direction. If you make sure all needed streaming tests are using it, and keep this state within that class, I think it would be a better change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16068: [SPARK-18637][SQL]Stateful UDF should be consider...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/16068#discussion_r91142141 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala --- @@ -487,6 +488,52 @@ class HiveUDFSuite extends QueryTest with TestHiveSingleton with SQLTestUtils { assert(count4 == 1) sql("DROP TABLE parquet_tmp") } + + test("Hive Stateful UDF") { +withUserDefinedFunction("statefulUDF" -> true, "statelessUDF" -> true) { + sql(s"CREATE TEMPORARY FUNCTION statefulUDF AS '${classOf[StatefulUDF].getName}'") + sql(s"CREATE TEMPORARY FUNCTION statelessUDF AS '${classOf[StatelessUDF].getName}'") + withTempView("inputTable") { --- End diff -- The test cannot resolve the function and throw the error if I use: test("Hive Stateful UDF") { withUserDefinedFunction("statefulUDF" -> true, "statelessUDF" -> true) { sql(s"CREATE TEMPORARY FUNCTION statefulUDF AS '${classOf[StatefulUDF].getName}'") sql(s"CREATE TEMPORARY FUNCTION statelessUDF AS '${classOf[StatelessUDF].getName}'") val testData = spark.range(10).repartition(1) println(s"start session: $spark") val m = testData.select("statefulUDF() as s, ") checkAnswer(testData.select("statefulUDF() as s").agg(max($"s")), Row(10)) ... Do I miss anything, or is it a bug? I will investigate why this happens. Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`statefulUDF() as s`' given input columns: [id];; 'Project ['statefulUDF() as s] +- Repartition 1, true +- Range (0, 10, step=1, splits=Some(1)) org.apache.spark.sql.AnalysisException: cannot resolve '`statefulUDF() as s`' given input columns: [id];; 'Project ['statefulUDF() as s] +- Repartition 1, true +- Range (0, 10, step=1, splits=Some(1)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:314) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:314) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:313) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:192) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature
[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16175 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/16175 @gatorsmile Could you please take a look when you have time? Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkContext ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16174 > This is resource intensive and it can lead to unneeded test failures (flakyness) when park.driver.allowMultipleContexts is disabled (this happens when the order of tests changes). Do you mean the recent failures? It was fixed by #16105. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/16175 [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast ## What changes were proposed in this pull request? 1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0 2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow. ## How was this patch tested? I will add a test case to make sure the statistics.sizeInBytes won't overflow. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-17460 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16175 commit 9ad6725975e36e513ce89f54582409bd3e9d2cfe Author: Huaxin GaoDate: 2016-12-06T18:22:38Z [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16128: [SPARK-18671][SS][TEST] Added tests to ensure stability ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16128 **[Test build #69740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69740/consoleFull)** for PR 16128 at commit [`26a86d6`](https://github.com/apache/spark/commit/26a86d64f2f492094960b19332cabd7457f95e61). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16128: [SPARK-18671][SS][TEST] Added tests to ensure stability ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16128 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16172 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/16172 LGTM. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16014: [SPARK-18590][SPARKR] build R source package when...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16014#discussion_r91135457 --- Diff: dev/create-release/release-build.sh --- @@ -221,14 +235,13 @@ if [[ "$1" == "package" ]]; then # We increment the Zinc port each time to avoid OOM's and other craziness if multiple builds # share the same Zinc server. - # Make R source package only once. (--r) FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos" make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" & make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" & make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" & make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" & - make_binary_release "without-hadoop" "--r -Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" & + make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" & --- End diff -- Any specific reason to use the `without-hadoop` build for the R package ? Just wondering if this will affect the users in any fashion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16168 also cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16168 @jiangxb1987 We need a low level design about the changes you made here, especially document the issue you hit and the solution you choose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16138: [WIP][SPARK-16609] Add to_date/to_timestamp with format ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16138 **[Test build #69739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69739/consoleFull)** for PR 16138 at commit [`760650c`](https://github.com/apache/spark/commit/760650c312cb3e514f85404bf307ec4b41f00bb9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [WIP][SPARK-16609] Add to_date/to_timestamp with ...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r91134199 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1047,6 +1047,53 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn } /** + * Parses a column with a format to a timestamp. + */ +@ExpressionDescription( + usage = "_FUNC_(timestamp_str, fmt) - Parses the `timestamp` expression " + +"with the `fmt` expression.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 + """) +case class ParseToDate(left: Expression, right: Expression, child: Expression) + extends RuntimeReplaceable with ImplicitCastInputTypes { + + def this(left: Expression, format: Expression) = { +this(left, format, ToDate(new ParseToTimestamp(left, format))) + } + + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType) + override def dataType: DataType = DateType + override def flatArguments: Iterator[Any] = Iterator(left, right) + override def sql: String = s"$prettyName(${left.sql}, ${right.sql})" +} + +/** + * Parses a column with a format to a timestamp. + */ +@ExpressionDescription( + usage = "_FUNC_(timestamp, fmt) - Parses the `timestamp` expression with the `fmt` expression.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 00:00:00.0 + """) +case class ParseToTimestamp(left: Expression, right: Expression, child: Expression) + extends RuntimeReplaceable with ImplicitCastInputTypes { + + def this(left: Expression, format: Expression) = { + this(left, format, Cast(UnixTimestamp(left, format), TimestampType)) --- End diff -- This is returning the date in millis (1970) as opposed to seconds. But if you use the `UnixTimestamp` expression via the function, then cast, it will work just fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [WIP][SPARK-16609] Add to_date/to_timestamp with ...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r91134077 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1047,6 +1047,53 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn } /** + * Parses a column with a format to a timestamp. + */ +@ExpressionDescription( + usage = "_FUNC_(timestamp_str, fmt) - Parses the `timestamp` expression " + +"with the `fmt` expression.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 + """) +case class ParseToDate(left: Expression, right: Expression, child: Expression) + extends RuntimeReplaceable with ImplicitCastInputTypes { + + def this(left: Expression, format: Expression) = { +this(left, format, ToDate(new ParseToTimestamp(left, format))) + } + + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType) + override def dataType: DataType = DateType + override def flatArguments: Iterator[Any] = Iterator(left, right) + override def sql: String = s"$prettyName(${left.sql}, ${right.sql})" +} + +/** + * Parses a column with a format to a timestamp. + */ +@ExpressionDescription( + usage = "_FUNC_(timestamp, fmt) - Parses the `timestamp` expression with the `fmt` expression.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 00:00:00.0 + """) +case class ParseToTimestamp(left: Expression, right: Expression, child: Expression) + extends RuntimeReplaceable with ImplicitCastInputTypes { + + def this(left: Expression, format: Expression) = { + this(left, format, Cast(UnixTimestamp(left, format), TimestampType)) --- End diff -- For reasons unbeknownst to me, Cast(, TimestampType) is not equivalent to cast("timestamp") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91133855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -557,9 +557,12 @@ class SessionCatalog( * If the relation is a view, the relation will be wrapped in a [[SubqueryAlias]] which will * track the name of the view. */ - def lookupRelation(name: TableIdentifier, alias: Option[String] = None): LogicalPlan = { + def lookupRelation( + name: TableIdentifier, + alias: Option[String] = None, + databaseHint: Option[String] = None): LogicalPlan = { --- End diff -- Need a document for these parameters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91133644 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -509,32 +509,42 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { +private def lookupTableFromCatalog( +u: UnresolvedRelation, db: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupRelation(u.tableIdentifier, u.alias, db) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) - case u: UnresolvedRelation => -val table = u.tableIdentifier -if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && +def apply(plan: LogicalPlan): LogicalPlan = { + var currentDatabase = catalog.getCurrentDatabase + plan resolveOperators { +case i@InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => + i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) +case u: UnresolvedRelation => + val table = u.tableIdentifier + if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && (!catalog.databaseExists(table.database.get) || !catalog.tableExists(table))) { - // If the database part is specified, and we support running SQL directly on files, and - // it's not a temporary view, and the table does not exist, then let's just return the - // original UnresolvedRelation. It is possible we are matching a query like "select * - // from parquet.`/path/to/query`". The plan will get resolved later. - // Note that we are testing (!db_exists || !table_exists) because the catalog throws - // an exception from tableExists if the database does not exist. - u -} else { - lookupTableFromCatalog(u) -} +// If the database part is specified, and we support running SQL directly on files, and +// it's not a temporary view, and the table does not exist, then let's just return the +// original UnresolvedRelation. It is possible we are matching a query like "select * +// from parquet.`/path/to/query`". The plan will get resolved later. +// Note that we are testing (!db_exists || !table_exists) because the catalog throws +// an exception from tableExists if the database does not exist. +u + } else { +val logicalPlan = lookupTableFromCatalog(u, Some(currentDatabase)) +currentDatabase = logicalPlan.collectFirst { --- End diff -- Please avoid using `var` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91133302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -509,32 +509,42 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { +private def lookupTableFromCatalog( +u: UnresolvedRelation, db: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupRelation(u.tableIdentifier, u.alias, db) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) - case u: UnresolvedRelation => -val table = u.tableIdentifier -if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && +def apply(plan: LogicalPlan): LogicalPlan = { + var currentDatabase = catalog.getCurrentDatabase --- End diff -- Why we need to do it here? This should be moved to https://github.com/jiangxb1987/spark/blob/434009ef7fcfc552314dca35dc9395bb98f5e9c6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L539 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16138: [WIP][SPARK-16609] Add to_date/to_timestamp with format ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16138 **[Test build #69738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69738/consoleFull)** for PR 16138 at commit [`f1e075c`](https://github.com/apache/spark/commit/f1e075c49ba66f81e9cc40b9d96030f6afe6d9a2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91131928 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -221,11 +223,14 @@ case class CatalogTable( s"Last Access: ${new Date(lastAccessTime).toString}", s"Type: ${tableType.name}", if (schema.nonEmpty) s"Schema: ${schema.mkString("[", ", ", "]")}" else "", +if (originalSchema.isDefined) s"Original Schema: ${originalSchema.mkString("[", ", ", "]")}" +else "", --- End diff -- If one line is not enough, you can create a variable and then use the variable here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91131668 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -557,9 +557,12 @@ class SessionCatalog( * If the relation is a view, the relation will be wrapped in a [[SubqueryAlias]] which will * track the name of the view. */ - def lookupRelation(name: TableIdentifier, alias: Option[String] = None): LogicalPlan = { + def lookupRelation( --- End diff -- General suggestion: when you made a change like this, please update the above function descriptions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15819 Could you post the layout of that staging folder? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16138: [WIP][SPARK-16609] Add to_date/to_timestamp with format ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16138 **[Test build #69737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69737/consoleFull)** for PR 16138 at commit [`8e8339c`](https://github.com/apache/spark/commit/8e8339cf0dc522616d7eba8f4b7ca3b35dd4229b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16000: [SPARK-18537][Web UI]Add a REST api to spark stre...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/16000#discussion_r91123740 --- Diff: streaming/src/main/java/org/apache/spark/streaming/status/api/v1/BatchStatus.java --- @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.status.api.v1; --- End diff -- Pretty much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkContext ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16174 **[Test build #69736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69736/consoleFull)** for PR 16174 at commit [`04ce488`](https://github.com/apache/spark/commit/04ce488e9d6927f3cb172e854928c78ec094fb8d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16172 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69731/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16172 **[Test build #69731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69731/consoleFull)** for PR 16172 at commit [`5401e64`](https://github.com/apache/spark/commit/5401e64552325353104bc7addf180a5a37f35047). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16092: [SPARK-18662] Move resource managers to separate directo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16092 **[Test build #69735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69735/consoleFull)** for PR 16092 at commit [`3e78591`](https://github.com/apache/spark/commit/3e7859134b9deeec2a160c9e60d0ad69eee719f1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkC...
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/16174 [SPARK-18741][STREAMING] Reuse or clean-up SparkContext in streaming tests ## What changes were proposed in this pull request? Tests in Spark Streaming currently create a `SparkContext` for each test, and sometimes do not clean-up afterwards. This is resource intensive and it can lead to unneeded test failures (flakyness) when park.driver.allowMultipleContexts is disabled (this happens when the order of tests changes). This PR makes most test re-use a `SparkContext`. For tests that have to create a new context (for instance `CheckpointSuite`) we make sure that no active `SparkContext` exists before the test, and that the created `SparkContext` is cleaned up afterwards. I have refactored the `TestSuiteBase` into two classes `TestSuiteBase` and a parent class `ReusableSparkContext`; this to make `SparkContext` management relatively straightforward for most tests. I have done a simple very unscientific benchmark (n=1), and streaming tests with this patch took 212 seconds and streaming tests without this patch took 252 seconds. ## How was this patch tested? The patch only covers test code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-18741 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16174 commit 04ce488e9d6927f3cb172e854928c78ec094fb8d Author: Herman van HovellDate: 2016-12-06T16:29:23Z Reuse or clean-up SparkContext in streaming tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16174: [SPARK-18741][STREAMING] Reuse or clean-up SparkContext ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/16174 cc @tdas @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16092: [SPARK-18662] Move resource managers to separate directo...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/16092 @vanzin Done. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16006: [SPARK-18580] [DStreams] [external/kafka-0-10] Us...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/16006#discussion_r91118893 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala --- @@ -143,9 +147,14 @@ private[spark] class DirectKafkaInputDStream[K, V]( lagPerPartition.map { case (tp, lag) => val maxRateLimitPerPartition = ppc.maxRatePerPartition(tp) - val backpressureRate = Math.round(lag / totalLag.toFloat * rate) - tp -> (if (maxRateLimitPerPartition > 0) { -Math.min(backpressureRate, maxRateLimitPerPartition)} else backpressureRate) + val effectiveRate = if (rate >= 0) rate else backpressureInitialRate + val estimateRate = Math.round(lag / totalLag.toFloat * effectiveRate) + val backpressureRate = +if (estimateRate > maxRateLimitPerPartition && maxRateLimitPerPartition > 0) { + maxRateLimitPerPartition +} +else estimateRate --- End diff -- http://spark.apache.org/contributing.html use braces around the else clause unless it's all on one line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16006: [SPARK-18580] [DStreams] [external/kafka-0-10] Us...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/16006#discussion_r91118458 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala --- @@ -67,6 +67,10 @@ private[spark] class DirectKafkaInputDStream[K, V]( ekp } + val backpressureInitialRate: Long = + _ssc.sparkContext.conf.getLong("spark.streaming.backpressure.initialRate", + _ssc.sparkContext.conf.getDouble("spark.streaming.backpressure.pid.minRate", 100).round) --- End diff -- Shouldn't these be using ssc instead of _ssc for consistency? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16173 **[Test build #69734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69734/consoleFull)** for PR 16173 at commit [`f23becc`](https://github.com/apache/spark/commit/f23beccf1d484ea5121f6cf605b92090400808ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16173 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16173 **[Test build #69733 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69733/consoleFull)** for PR 16173 at commit [`1b4d4b6`](https://github.com/apache/spark/commit/1b4d4b632523f33b38e6997dddbba954f324ea59). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16173 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69733/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16173 **[Test build #69733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69733/consoleFull)** for PR 16173 at commit [`1b4d4b6`](https://github.com/apache/spark/commit/1b4d4b632523f33b38e6997dddbba954f324ea59). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r9663 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -448,19 +476,105 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("Using view after change the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + withTable("jt2", "jt3") { +// Don't change the view schema +val df2 = (1 until 10).map(i => i + i).toDF("id") +df2.write.format("json").saveAsTable("jt2") +sql("ALTER VIEW v1 AS SELECT * FROM jt2") +// the view v2 should have the same output with the view v1 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM v1")) +// the view v2 should have the same output with the table jt2 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM jt2")) + +// Change the view schema +val df3 = (1 until 10).map(i => i -> i).toDF("i", "j") +df3.write.format("json").saveAsTable("jt3") +sql("ALTER VIEW v1 AS SELECT * FROM jt3") +val e = intercept[AnalysisException] { + sql("SELECT * FROM v2") +} +assert(e.message.contains( + "The underlying schema doesn't match the original schema, expected " + +"STRUCT<`id`: INT> but got STRUCT<`i`: INT, `j`: INT>")) + } +} + } + + test("Using view after drop the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + // Drop the referenced view + sql("DROP VIEW v1") + val e = intercept[RuntimeException] { +sql("SELECT * FROM v2") + } + assert(e.getMessage.contains( +"Failed to analyze the canonicalized SQL")) +} + } + + test("Using view after change the origin table") { +withTable("tab1") { + val df = (1 until 10).map(i => i).toDF("i") + df.write.format("json").saveAsTable("tab1") + withView("v1", "v2") { +sql("CREATE VIEW v1 AS SELECT * FROM tab1") +sql("CREATE VIEW v2 AS SELECT * FROM v1") +// Don't change the table schema +val df2 = (1 until 10).map(i => i * i).toDF("i") +df2.write.format("json").mode("overwrite").saveAsTable("tab1") +// the view v2 should have the same output with the view v1 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM v1")) +// the view v2 should have the same output with the table testTable --- End diff -- `testTable` or `tab1`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16173 **[Test build #69732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69732/consoleFull)** for PR 16173 at commit [`a17d325`](https://github.com/apache/spark/commit/a17d3250d4472619fa987d9b389d48b165c44745). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16173 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16173 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69732/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16173: [SPARK-18742][CORE]readd spark.broadcast.factory conf to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16173 **[Test build #69732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69732/consoleFull)** for PR 16173 at commit [`a17d325`](https://github.com/apache/spark/commit/a17d3250d4472619fa987d9b389d48b165c44745). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16173: [SPARK-18742][CORE]readd spark.broadcast.factory ...
GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/16173 [SPARK-18742][CORE]readd spark.broadcast.factory conf to implement er-defined BroadcastFactory ## What changes were proposed in this pull request? After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. No code in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ which is not correct since there is no way to plug in a custom user-specified BroadcastFactory. It is better to readd spark.broadcast.factory for user-defined BroadcastFactory [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 ## How was this patch tested? unit test added You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark addBroadFactoryConf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16173 commit a17d3250d4472619fa987d9b389d48b165c44745 Author: rootDate: 2016-12-06T16:14:26Z [SPARK-18742][CORE]readd spark.broadcast.factory conf to implement user-defined BroadcastFactory --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91110663 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -448,19 +476,105 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("Using view after change the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + withTable("jt2", "jt3") { +// Don't change the view schema +val df2 = (1 until 10).map(i => i + i).toDF("id") --- End diff -- I suggest to twist this test case a bit. The first part of the test case has the table jt defined as columns ("j", "i") and then the second part jt2 defined as, what you currently have, columns("i", "j") to test that the code checks the order of the columns correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91108821 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -207,31 +205,56 @@ case class CreateViewCommand( } /** - * Returns a [[CatalogTable]] that can be used to save in the catalog. This comment canonicalize - * SQL based on the analyzed plan, and also creates the proper schema for the view. + * Returns a [[CatalogTable]] that can be used to save in the catalog. This stores the following + * properties for a view: + * 1. The `viewText` which is used to generate a logical plan when we resolve a view; + * 2. The `currentDatabase` which sets the current database on Analyze stage; + * 3. The `schema` which ensure we generate the correct output. */ private def prepareTable(sparkSession: SparkSession, aliasedPlan: LogicalPlan): CatalogTable = { -val viewSQL: String = new SQLBuilder(aliasedPlan).toSQL +val currentDatabase = sparkSession.sessionState.catalog.getCurrentDatabase -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - sparkSession.sql(viewSQL).queryExecution.assertAnalyzed() -} catch { - case NonFatal(e) => -throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) -} +if (originalText.isDefined) { + val viewSQL = originalText.get + + // Validate the view SQL - make sure we can resolve it with currentDatabase. + val originalSchema = try { +val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewSQL) +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan.schema + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) + } -CatalogTable( - identifier = name, - tableType = CatalogTableType.VIEW, - storage = CatalogStorageFormat.empty, - schema = aliasedPlan.schema, - properties = properties, - viewOriginalText = originalText, - viewText = Some(viewSQL), - comment = comment -) + CatalogTable( +identifier = name, +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = aliasedPlan.schema, +originalSchema = Some(originalSchema), +properties = properties, +viewOriginalText = originalText, +viewText = Some(viewSQL), +currentDatabase = Some(currentDatabase), +comment = comment + ) +} else { --- End diff -- I guess the code in the `else` happens when the object is a table, not a view, correct? Please add this to the prologue of the function for clarity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r9345 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -448,19 +476,105 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("Using view after change the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + withTable("jt2", "jt3") { +// Don't change the view schema +val df2 = (1 until 10).map(i => i + i).toDF("id") +df2.write.format("json").saveAsTable("jt2") +sql("ALTER VIEW v1 AS SELECT * FROM jt2") +// the view v2 should have the same output with the view v1 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM v1")) +// the view v2 should have the same output with the table jt2 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM jt2")) + +// Change the view schema +val df3 = (1 until 10).map(i => i -> i).toDF("i", "j") +df3.write.format("json").saveAsTable("jt3") +sql("ALTER VIEW v1 AS SELECT * FROM jt3") +val e = intercept[AnalysisException] { + sql("SELECT * FROM v2") +} +assert(e.message.contains( + "The underlying schema doesn't match the original schema, expected " + +"STRUCT<`id`: INT> but got STRUCT<`i`: INT, `j`: INT>")) + } +} + } + + test("Using view after drop the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + // Drop the referenced view + sql("DROP VIEW v1") + val e = intercept[RuntimeException] { +sql("SELECT * FROM v2") + } + assert(e.getMessage.contains( +"Failed to analyze the canonicalized SQL")) +} + } --- End diff -- I suggest adding a variation of this test case, dropping the underlining table "jt" and v2 should be invalid. This will test the multiple-level expansion of a view. It is also no harm to check that v1 is also invalid in this new test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91106867 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -151,6 +151,7 @@ case class CatalogTable( tableType: CatalogTableType, storage: CatalogStorageFormat, schema: StructType, +originalSchema: Option[StructType] = None, --- End diff -- I guess this is to pass 1) the persistent schema of the view at the last call to `(Simple)Analyzer`, and 2) the schema of the view from calling `Analyzer` on the definition of the view definition captured in SQL string input from the user. If this is correct, we need better variable names (and a few comments will greatly help.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91112094 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -448,19 +476,105 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("Using view after change the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + withTable("jt2", "jt3") { +// Don't change the view schema +val df2 = (1 until 10).map(i => i + i).toDF("id") +df2.write.format("json").saveAsTable("jt2") +sql("ALTER VIEW v1 AS SELECT * FROM jt2") +// the view v2 should have the same output with the view v1 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM v1")) +// the view v2 should have the same output with the table jt2 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM jt2")) + +// Change the view schema +val df3 = (1 until 10).map(i => i -> i).toDF("i", "j") +df3.write.format("json").saveAsTable("jt3") +sql("ALTER VIEW v1 AS SELECT * FROM jt3") +val e = intercept[AnalysisException] { + sql("SELECT * FROM v2") +} +assert(e.message.contains( + "The underlying schema doesn't match the original schema, expected " + +"STRUCT<`id`: INT> but got STRUCT<`i`: INT, `j`: INT>")) + } +} + } + + test("Using view after drop the origin view") { +withView("v1", "v2") { + sql("CREATE VIEW v1 AS SELECT id FROM jt") + sql("CREATE VIEW v2 AS SELECT * FROM v1") + // Drop the referenced view + sql("DROP VIEW v1") + val e = intercept[RuntimeException] { +sql("SELECT * FROM v2") + } + assert(e.getMessage.contains( +"Failed to analyze the canonicalized SQL")) +} + } + + test("Using view after change the origin table") { +withTable("tab1") { + val df = (1 until 10).map(i => i).toDF("i") + df.write.format("json").saveAsTable("tab1") + withView("v1", "v2") { +sql("CREATE VIEW v1 AS SELECT * FROM tab1") +sql("CREATE VIEW v2 AS SELECT * FROM v1") +// Don't change the table schema +val df2 = (1 until 10).map(i => i * i).toDF("i") +df2.write.format("json").mode("overwrite").saveAsTable("tab1") +// the view v2 should have the same output with the view v1 +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM v1")) +// the view v2 should have the same output with the table testTable +checkAnswer(sql("SELECT * FROM v2"), sql("SELECT * FROM tab1")) + +// Change the table schema +val df3 = (1 until 10).map(i => i -> i).toDF("a", "b") +df3.write.format("json").mode("overwrite").saveAsTable("tab1") +val e = intercept[RuntimeException] { + sql("SELECT * FROM v2") +} +assert(e.getMessage.contains( + "Failed to analyze the canonicalized SQL")) + } +} + } + + test("Using view after drop the origin table") { +withTable("tab1") { + val df = (1 until 10).map(i => i).toDF("i") + df.write.format("json").saveAsTable("tab1") + withView("v1") { +sql("CREATE VIEW v1 AS SELECT * FROM tab1") +// Drop the referenced table +sql("DROP TABLE tab1") +val e = intercept[RuntimeException] { + sql("SELECT * FROM v1") +} +assert(e.getMessage.contains( + "Failed to analyze the canonicalized SQL")) + } +} + } + --- End diff -- If you add a new test case to test the multiple-level expansion as I suggested before, this test case is just a duplicate. It may be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16006: [SPARK-18580] [DStreams] [external/kafka-0-10] Use spark...
Github user omuravskiy commented on the issue: https://github.com/apache/spark/pull/16006 I added new commit, the default value of `backpressureInitialRate` if `spark.streaming.backpressure.initialRate` is now taken from `spark.streaming.backpressure.pid.minRate`, or 100 if that is also not set (which is the same as in `org.apache.spark.streaming.scheduler.rate.RateEstimator`). And `backpressureInitialRate`, when applied, is now shared proportionally among all partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16168 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16168 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69730/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91104224 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -221,11 +223,14 @@ case class CatalogTable( s"Last Access: ${new Date(lastAccessTime).toString}", s"Type: ${tableType.name}", if (schema.nonEmpty) s"Schema: ${schema.mkString("[", ", ", "]")}" else "", +if (originalSchema.isDefined) s"Original Schema: ${originalSchema.mkString("[", ", ", "]")}" +else "", if (provider.isDefined) s"Provider: ${provider.get}" else "", if (partitionColumnNames.nonEmpty) s"Partition Columns: $partitionColumns" else "" ) ++ bucketStrings ++ Seq( viewOriginalText.map("Original View: " + _).getOrElse(""), viewText.map("View: " + _).getOrElse(""), +currentDatabase.map("Database Hint: " + _).getOrElse(""), --- End diff -- Same comment. This is an optional database name. Should we change the string to `"Database name: "`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16168 **[Test build #69730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69730/consoleFull)** for PR 16168 at commit [`434009e`](https://github.com/apache/spark/commit/434009ef7fcfc552314dca35dc9395bb98f5e9c6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91102867 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -557,9 +557,12 @@ class SessionCatalog( * If the relation is a view, the relation will be wrapped in a [[SubqueryAlias]] which will * track the name of the view. */ - def lookupRelation(name: TableIdentifier, alias: Option[String] = None): LogicalPlan = { + def lookupRelation( + name: TableIdentifier, + alias: Option[String] = None, + databaseHint: Option[String] = None): LogicalPlan = { --- End diff -- I don't think the variable `databaseHint` is properly named. It is simply an optional database name passed into this function. Should it be named `databaseName`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15722: [SPARK-18208] [Shuffle] Executor OOM due to a growing Lo...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15722 @jiexiong PR descriptions are used in git commit messages, and should be clear and concise. The fix LGTM, but the description should be improved for future reference. How about we change it into the following (I only redacted the comments in the PR): ## What changes were proposed in this pull request? `BytesToBytesMap` currently does not release the in-memory storage (the `longArray` variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the `longArray` at a conservative rate. However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the `longArray` can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM. This PR fixes this issuing by freeing the `longArray` when the `BytesToBytesMap` spills. ## How was this patch tested? Existing tests and tested on realworld workloads. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16172: [SPARK-18740] Log spark.app.name in driver logs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16172 **[Test build #69731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69731/consoleFull)** for PR 16172 at commit [`5401e64`](https://github.com/apache/spark/commit/5401e64552325353104bc7addf180a5a37f35047). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16172: [SPARK-18740] Log spark.app.name in driver logs
GitHub user peterableda opened a pull request: https://github.com/apache/spark/pull/16172 [SPARK-18740] Log spark.app.name in driver logs ## What changes were proposed in this pull request? Added simple logInfo line to print out the `spark.app.name` in the driver logs ## How was this patch tested? Spark was built and tested with SparkPi app. Example log: ``` 16/12/06 05:49:50 INFO spark.SparkContext: Running Spark version 2.0.0 16/12/06 05:49:52 INFO spark.SparkContext: Submitted application: Spark Pi 16/12/06 05:49:52 INFO spark.SecurityManager: Changing view acls to: root 16/12/06 05:49:52 INFO spark.SecurityManager: Changing modify acls to: root ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/peterableda/spark feature/print_appname Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16172.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16172 commit 5401e64552325353104bc7addf180a5a37f35047 Author: Peter AbledaDate: 2016-12-06T14:18:03Z log out spark.app.name in the Spark driver logs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16169: [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/16169 cc @felixcheung @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13909 @cloud-fan thank you for your thoughtful review comments. The latest one looks better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13909 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69728/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13909 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13909 **[Test build #69728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69728/consoleFull)** for PR 13909 at commit [`185ddf5`](https://github.com/apache/spark/commit/185ddf58b9d2e30c76589de4f566911e7855ea79). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15736 New data for us, inlined comparator scores here (code provided below to check I've not profiled something useless!): ``` ScalaSparkPagerank 2016-12-05 13:44:41 25992811548.149 5398411 5398411 ScalaSparkPagerank 2016-12-05 13:46:43 25992811546.897 5542531 5542531 ScalaSparkPagerank 2016-12-05 13:48:46 25992811549.130 5290619 5290619 ScalaSparkPagerank 2016-12-05 13:50:49 25992811549.793 5220173 5220173 ScalaSparkPagerank 2016-12-05 13:52:50 25992811548.061 5408296 5408296 ScalaSparkPagerank 2016-12-05 13:54:52 25992811546.468 5593701 5593701 ScalaSparkPagerank 2016-12-05 13:56:56 25992811551.385 5058443 5058443 ScalaSparkPagerank 2016-12-05 13:58:59 25992811547.857 5431349 5431349 ScalaSparkPagerank 2016-12-05 14:00:59 25992811546.515 5588049 5588049 ScalaSparkPagerank 2016-12-05 14:03:03 25992811547.791 5438850 5438850 Avg 48.2046s ``` Remember our "vanilla" average time is 47.752s and our first commit averaged 47.229s (so not much of a difference really). I think we're splitting hairs and I've got another PR I am seeing good results on that I plan to focus on instead: the SizeEstimator. This is what I've benchmarked, PartitionedAppendOnlyMap first, so let me know if there any further suggestions, otherwise I propose leaving this one for later as actually against the Spark master codebase I'm not noticing anything exciting. ``` def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]]) : Iterator[((Int, K), V)] = { val comparator = { if (keyComparator.isDefined) { val theKeyComp = keyComparator.get new Comparator[(Int, K)] { // We know we have a non-empty comparator here override def compare(a: (Int, K), b: (Int, K)): Int = { if (a._1 != b._1) { a._1 - b._1 } else { theKeyComp.compare(a._2, b._2) } } } } else { new Comparator[(Int, K)] { override def compare(a: (Int, K), b: (Int, K)): Int = { a._1 - b._1 } } } } destructiveSortedIterator(comparator) } ``` In PartitionedPairBuffer ``` /** Iterate through the data in a given order. For this class this is not really destructive. */ override def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]]) : Iterator[((Int, K), V)] = { val comparator = { if (keyComparator.isDefined) { val theKeyComp = keyComparator.get new Comparator[(Int, K)] { // We know we have a non-empty comparator here override def compare(a: (Int, K), b: (Int, K)): Int = { if (a._1 != b._1) { a._1 - b._1 } else { theKeyComp.compare(a._2, b._2) } } } } else { new Comparator[(Int, K)] { override def compare(a: (Int, K), b: (Int, K)): Int = { a._1 - b._1 } } } } new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, curSize, comparator) iterator } ``` WritablePartitionedPairCollection remains unchanged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16170 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16170 @hvanhovell Thanks for helping fixing this I missed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/16170 Thanks for the review. Merging this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16170 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91054484 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -114,9 +117,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) + val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewText).transform { +case u: UnresolvedRelation if u.tableIdentifier.database.isEmpty => + u.copy(tableIdentifier = TableIdentifier(u.tableIdentifier.table, table.currentDatabase)) + } + // Resolve the plan and check whether the analyzed plan is valid. + val resolvedPlan = try { --- End diff -- Duplicate of what is in `views.scala`, put this in a common object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91051441 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -207,31 +205,56 @@ case class CreateViewCommand( } /** - * Returns a [[CatalogTable]] that can be used to save in the catalog. This comment canonicalize - * SQL based on the analyzed plan, and also creates the proper schema for the view. + * Returns a [[CatalogTable]] that can be used to save in the catalog. This stores the following + * properties for a view: + * 1. The `viewText` which is used to generate a logical plan when we resolve a view; + * 2. The `currentDatabase` which sets the current database on Analyze stage; + * 3. The `schema` which ensure we generate the correct output. */ private def prepareTable(sparkSession: SparkSession, aliasedPlan: LogicalPlan): CatalogTable = { -val viewSQL: String = new SQLBuilder(aliasedPlan).toSQL +val currentDatabase = sparkSession.sessionState.catalog.getCurrentDatabase -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - sparkSession.sql(viewSQL).queryExecution.assertAnalyzed() -} catch { - case NonFatal(e) => -throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) -} +if (originalText.isDefined) { --- End diff -- When does this happen? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91053965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -509,32 +509,42 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { +private def lookupTableFromCatalog( +u: UnresolvedRelation, db: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupRelation(u.tableIdentifier, u.alias, db) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) - case u: UnresolvedRelation => -val table = u.tableIdentifier -if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && +def apply(plan: LogicalPlan): LogicalPlan = { + var currentDatabase = catalog.getCurrentDatabase + plan resolveOperators { +case i@InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => + i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) +case u: UnresolvedRelation => + val table = u.tableIdentifier + if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && (!catalog.databaseExists(table.database.get) || !catalog.tableExists(table))) { - // If the database part is specified, and we support running SQL directly on files, and - // it's not a temporary view, and the table does not exist, then let's just return the - // original UnresolvedRelation. It is possible we are matching a query like "select * - // from parquet.`/path/to/query`". The plan will get resolved later. - // Note that we are testing (!db_exists || !table_exists) because the catalog throws - // an exception from tableExists if the database does not exist. - u -} else { - lookupTableFromCatalog(u) -} +// If the database part is specified, and we support running SQL directly on files, and +// it's not a temporary view, and the table does not exist, then let's just return the +// original UnresolvedRelation. It is possible we are matching a query like "select * +// from parquet.`/path/to/query`". The plan will get resolved later. +// Note that we are testing (!db_exists || !table_exists) because the catalog throws +// an exception from tableExists if the database does not exist. +u + } else { +val logicalPlan = lookupTableFromCatalog(u, Some(currentDatabase)) +currentDatabase = logicalPlan.collectFirst { --- End diff -- BTW we could also choose to move most of the logic here into the `SessionCatalog`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91053897 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -509,32 +509,42 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { +private def lookupTableFromCatalog( +u: UnresolvedRelation, db: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupRelation(u.tableIdentifier, u.alias, db) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) - case u: UnresolvedRelation => -val table = u.tableIdentifier -if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && +def apply(plan: LogicalPlan): LogicalPlan = { + var currentDatabase = catalog.getCurrentDatabase + plan resolveOperators { +case i@InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => + i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) +case u: UnresolvedRelation => + val table = u.tableIdentifier + if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && (!catalog.databaseExists(table.database.get) || !catalog.tableExists(table))) { - // If the database part is specified, and we support running SQL directly on files, and - // it's not a temporary view, and the table does not exist, then let's just return the - // original UnresolvedRelation. It is possible we are matching a query like "select * - // from parquet.`/path/to/query`". The plan will get resolved later. - // Note that we are testing (!db_exists || !table_exists) because the catalog throws - // an exception from tableExists if the database does not exist. - u -} else { - lookupTableFromCatalog(u) -} +// If the database part is specified, and we support running SQL directly on files, and +// it's not a temporary view, and the table does not exist, then let's just return the +// original UnresolvedRelation. It is possible we are matching a query like "select * +// from parquet.`/path/to/query`". The plan will get resolved later. +// Note that we are testing (!db_exists || !table_exists) because the catalog throws +// an exception from tableExists if the database does not exist. +u + } else { +val logicalPlan = lookupTableFromCatalog(u, Some(currentDatabase)) +currentDatabase = logicalPlan.collectFirst { --- End diff -- This looks scary; for instance it depends on the iteration order for `collectFirst`. Why is this needed and why can't we use the current database? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91053671 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -509,32 +509,42 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { +private def lookupTableFromCatalog( +u: UnresolvedRelation, db: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupRelation(u.tableIdentifier, u.alias, db) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) - case u: UnresolvedRelation => -val table = u.tableIdentifier -if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && +def apply(plan: LogicalPlan): LogicalPlan = { + var currentDatabase = catalog.getCurrentDatabase + plan resolveOperators { +case i@InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => + i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) +case u: UnresolvedRelation => + val table = u.tableIdentifier + if (table.database.isDefined && conf.runSQLonFile && !catalog.isTemporaryTable(table) && --- End diff -- Why not put all this into a helper function and put the if statement into `case ... if ...`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91078953 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -151,6 +151,7 @@ case class CatalogTable( tableType: CatalogTableType, storage: CatalogStorageFormat, schema: StructType, +originalSchema: Option[StructType] = None, --- End diff -- The only difference between originalSchema and schema are the column names, right? If that is the case why not store the `source` column name in the schema field metadata? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91053117 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -151,6 +151,7 @@ case class CatalogTable( tableType: CatalogTableType, storage: CatalogStorageFormat, schema: StructType, +originalSchema: Option[StructType] = None, --- End diff -- It is nice to add some documentation for this. What is its purpose? The same goes for currentDatabase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16168: [SPARK-18209][SQL] More robust view canonicalization wit...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/16168 also cc @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91064615 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -126,6 +146,55 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log } } + /** + * Apply Projection on unresolved logical plan to: + * 1. Omit the columns which are not referenced by the view; + * 2. Reorder the columns to keep the same order with the view; + */ + private def withProjection(plan: LogicalPlan, schema: StructType): LogicalPlan = { +// All fields in schema should exist in plan.schema, or we should throw an AnalysisException +// to notify the underlying schema has been changed. +if (schema.fields.forall { field => + plan.schema.fields.exists(other => compareStructField(field, other))}) { + val output = schema.fields.map { field => +plan.output.find { expr => + expr.name == field.name && expr.dataType == field.dataType}.getOrElse( --- End diff -- This is a case-sensitive match; spark's default is case-insensitive matching. Please use a resolver for such a check. A more general thought is that we really should leverage the analyzer's capabilities here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91051621 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -207,31 +205,56 @@ case class CreateViewCommand( } /** - * Returns a [[CatalogTable]] that can be used to save in the catalog. This comment canonicalize - * SQL based on the analyzed plan, and also creates the proper schema for the view. + * Returns a [[CatalogTable]] that can be used to save in the catalog. This stores the following + * properties for a view: + * 1. The `viewText` which is used to generate a logical plan when we resolve a view; + * 2. The `currentDatabase` which sets the current database on Analyze stage; + * 3. The `schema` which ensure we generate the correct output. */ private def prepareTable(sparkSession: SparkSession, aliasedPlan: LogicalPlan): CatalogTable = { -val viewSQL: String = new SQLBuilder(aliasedPlan).toSQL +val currentDatabase = sparkSession.sessionState.catalog.getCurrentDatabase -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - sparkSession.sql(viewSQL).queryExecution.assertAnalyzed() -} catch { - case NonFatal(e) => -throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) -} +if (originalText.isDefined) { + val viewSQL = originalText.get + + // Validate the view SQL - make sure we can resolve it with currentDatabase. + val originalSchema = try { +val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewSQL) +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan.schema + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) --- End diff -- Why do you throw a RuntimeException? Just throw an AnalysisException. The message is also outdated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91051761 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -275,20 +297,26 @@ case class AlterViewAsCommand( throw new AnalysisException(s"${viewMeta.identifier} is not a view.") } -val viewSQL: String = new SQLBuilder(analyzedPlan).toSQL -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - session.sql(viewSQL).queryExecution.assertAnalyzed() +val currentDatabase = session.sessionState.catalog.getCurrentDatabase + +// Validate the view SQL - make sure we can resolve it with currentDatabase. +val originalSchema = try { --- End diff -- This is duplicate - should we put this into a common trait? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91065113 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -126,6 +146,55 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log } } + /** + * Apply Projection on unresolved logical plan to: + * 1. Omit the columns which are not referenced by the view; + * 2. Reorder the columns to keep the same order with the view; + */ + private def withProjection(plan: LogicalPlan, schema: StructType): LogicalPlan = { +// All fields in schema should exist in plan.schema, or we should throw an AnalysisException +// to notify the underlying schema has been changed. +if (schema.fields.forall { field => + plan.schema.fields.exists(other => compareStructField(field, other))}) { + val output = schema.fields.map { field => +plan.output.find { expr => + expr.name == field.name && expr.dataType == field.dataType}.getOrElse( +throw new AnalysisException("The underlying schema doesn't match the original " + + s"schema, expected ${schema.sql} but got ${plan.schema.sql}") + )} + Project(output, plan) +} else { + throw new AnalysisException("The underlying schema doesn't match the original schema, " + +s"expected ${schema.sql} but got ${plan.schema.sql}") +} + } + + /** + * Compare the both [[StructField]] to verify whether they have the same name and dataType. + */ + private def compareStructField(field: StructField, other: StructField): Boolean = { +field.name == other.name && field.dataType == other.dataType --- End diff -- case-sensitive checking, see my previous comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91078407 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -114,9 +117,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) + val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewText).transform { --- End diff -- This will probably break with common table expressions. We could just set the current database, have the analyzer solve it, and set the current database back to its old value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91066027 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -126,6 +146,55 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log } } + /** + * Apply Projection on unresolved logical plan to: + * 1. Omit the columns which are not referenced by the view; + * 2. Reorder the columns to keep the same order with the view; + */ + private def withProjection(plan: LogicalPlan, schema: StructType): LogicalPlan = { +// All fields in schema should exist in plan.schema, or we should throw an AnalysisException +// to notify the underlying schema has been changed. +if (schema.fields.forall { field => + plan.schema.fields.exists(other => compareStructField(field, other))}) { + val output = schema.fields.map { field => +plan.output.find { expr => + expr.name == field.name && expr.dataType == field.dataType}.getOrElse( +throw new AnalysisException("The underlying schema doesn't match the original " + + s"schema, expected ${schema.sql} but got ${plan.schema.sql}") + )} + Project(output, plan) +} else { + throw new AnalysisException("The underlying schema doesn't match the original schema, " + +s"expected ${schema.sql} but got ${plan.schema.sql}") +} + } + + /** + * Compare the both [[StructField]] to verify whether they have the same name and dataType. + */ + private def compareStructField(field: StructField, other: StructField): Boolean = { +field.name == other.name && field.dataType == other.dataType + } + + /** + * Aliases the schema of the LogicalPlan to the view attribute names + */ + private def aliasColumns(plan: LogicalPlan, fields: Seq[StructField]): LogicalPlan = { +val output = fields.map(field => (field.name, field.getComment)) +if (plan.output.size != output.size) { --- End diff -- Is this even possible? This should have been caught in `withProjection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91052968 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -207,31 +205,56 @@ case class CreateViewCommand( } /** - * Returns a [[CatalogTable]] that can be used to save in the catalog. This comment canonicalize - * SQL based on the analyzed plan, and also creates the proper schema for the view. + * Returns a [[CatalogTable]] that can be used to save in the catalog. This stores the following + * properties for a view: + * 1. The `viewText` which is used to generate a logical plan when we resolve a view; + * 2. The `currentDatabase` which sets the current database on Analyze stage; + * 3. The `schema` which ensure we generate the correct output. */ private def prepareTable(sparkSession: SparkSession, aliasedPlan: LogicalPlan): CatalogTable = { -val viewSQL: String = new SQLBuilder(aliasedPlan).toSQL +val currentDatabase = sparkSession.sessionState.catalog.getCurrentDatabase -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - sparkSession.sql(viewSQL).queryExecution.assertAnalyzed() -} catch { - case NonFatal(e) => -throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) -} +if (originalText.isDefined) { + val viewSQL = originalText.get + + // Validate the view SQL - make sure we can resolve it with currentDatabase. + val originalSchema = try { +val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewSQL) +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan.schema + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) --- End diff -- I think we are expecting an AnalysisException here, why not catch that instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91051675 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala --- @@ -275,20 +297,26 @@ case class AlterViewAsCommand( throw new AnalysisException(s"${viewMeta.identifier} is not a view.") } -val viewSQL: String = new SQLBuilder(analyzedPlan).toSQL -// Validate the view SQL - make sure we can parse it and analyze it. -// If we cannot analyze the generated query, there is probably a bug in SQL generation. -try { - session.sql(viewSQL).queryExecution.assertAnalyzed() +val currentDatabase = session.sessionState.catalog.getCurrentDatabase + +// Validate the view SQL - make sure we can resolve it with currentDatabase. +val originalSchema = try { + val unresolvedPlan = session.sessionState.sqlParser.parsePlan(originalText) + val resolvedPlan = session.sessionState.analyzer.execute(unresolvedPlan) + session.sessionState.analyzer.checkAnalysis(resolvedPlan) + + resolvedPlan.schema } catch { case NonFatal(e) => -throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewSQL", e) +throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $originalText", e) --- End diff -- Same a before --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91064795 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -114,9 +117,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) + val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewText).transform { +case u: UnresolvedRelation if u.tableIdentifier.database.isEmpty => + u.copy(tableIdentifier = TableIdentifier(u.tableIdentifier.table, table.currentDatabase)) + } + // Resolve the plan and check whether the analyzed plan is valid. + val resolvedPlan = try { +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewText", e) + } + val planWithProjection = table.originalSchema.map(withProjection(resolvedPlan, _)) --- End diff -- So this is correct, but it is really hard to grasp. Could you use an if ... else .. here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91064962 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -114,9 +117,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) + val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewText).transform { +case u: UnresolvedRelation if u.tableIdentifier.database.isEmpty => + u.copy(tableIdentifier = TableIdentifier(u.tableIdentifier.table, table.currentDatabase)) + } + // Resolve the plan and check whether the analyzed plan is valid. + val resolvedPlan = try { +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewText", e) + } + val planWithProjection = table.originalSchema.map(withProjection(resolvedPlan, _)) --- End diff -- This is only for here for legacy reasons right? So a view using the new code path always has an originalSchema defined? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91065717 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -114,9 +117,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) + val unresolvedPlan = sparkSession.sessionState.sqlParser.parsePlan(viewText).transform { +case u: UnresolvedRelation if u.tableIdentifier.database.isEmpty => + u.copy(tableIdentifier = TableIdentifier(u.tableIdentifier.table, table.currentDatabase)) + } + // Resolve the plan and check whether the analyzed plan is valid. + val resolvedPlan = try { +val resolvedPlan = sparkSession.sessionState.analyzer.execute(unresolvedPlan) +sparkSession.sessionState.analyzer.checkAnalysis(resolvedPlan) + +resolvedPlan + } catch { +case NonFatal(e) => + throw new RuntimeException(s"Failed to analyze the canonicalized SQL: $viewText", e) + } + val planWithProjection = table.originalSchema.map(withProjection(resolvedPlan, _)) +.getOrElse(resolvedPlan) + SubqueryAlias( alias.getOrElse(table.identifier.table), -sparkSession.sessionState.sqlParser.parsePlan(viewText), +aliasColumns(planWithProjection, table.schema.fields), --- End diff -- Could we try to do one projection instead of two? And could we also avoid a projection if we don't need it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16168#discussion_r91064259 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -126,6 +146,55 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log } } + /** + * Apply Projection on unresolved logical plan to: + * 1. Omit the columns which are not referenced by the view; + * 2. Reorder the columns to keep the same order with the view; + */ + private def withProjection(plan: LogicalPlan, schema: StructType): LogicalPlan = { +// All fields in schema should exist in plan.schema, or we should throw an AnalysisException +// to notify the underlying schema has been changed. +if (schema.fields.forall { field => --- End diff -- It took me while to figure out what was going on here. Please restructure this. You could just do the check in a different code block since you are throwing an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16170 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69726/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16170 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16170: [SPARK-18634][SQL][TRIVIAL] Touch-up Generate
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16170 **[Test build #69726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69726/consoleFull)** for PR 16170 at commit [`8bf57e0`](https://github.com/apache/spark/commit/8bf57e04cf50790f8788791353c35a3a847712eb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org