[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16547 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16547 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71193/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16547 **[Test build #71193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71193/testReport)** for PR 16547 at commit [`c9f62c1`](https://github.com/apache/spark/commit/c9f62c161ad94af35ad237673c85428ce6094ac5). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16547 **[Test build #71193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71193/testReport)** for PR 16547 at commit [`c9f62c1`](https://github.com/apache/spark/commit/c9f62c161ad94af35ad237673c85428ce6094ac5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95526238 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -378,6 +379,35 @@ case class InsertIntoTable( } /** + * A container for holding the view description(CatalogTable), and the output of the view. The + * child should be a logical plan parsed from the `CatalogTable.viewText`, should throw an error + * if the `viewText` is not defined. + * This operator will be removed at the end of analysis stage. + * + * @param desc A view description(CatalogTable) that provides necessary information to resolve the + * view. + * @param output The output of a view operator, this is generated during planning the view, so that + * we are able to decouple the output from the underlying structure. + * @param child The logical plan of a view operator, it should be a logical plan parsed from the + * `CatalogTable.viewText`, should throw an error if the `viewText` is not defined. + */ +case class View( +desc: CatalogTable, +output: Seq[Attribute], +child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation { --- End diff -- We only extend `MultiInstanceRelation` for the leaf node. Any reason why it is needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16547: [SPARK-19168][Structured Streaming] Improvement: ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/16547 [SPARK-19168][Structured Streaming] Improvement: filter late data using watermark for `Append` mode ## What changes were proposed in this pull request? Currently we're filtering late data using watermark for `Update` mode; maybe we should do the same for `Append` mode. Note this is an improvement rather than correctness fix, because the current behavior of `Append` mode is quite correct even without this. ## How was this patch tested? commit #1 of this patch added `numRowsUpdated` checks in `EventTimeWatermarkSuite.scala`: ```scala line 139: AddData(inputData, 10), line 140: CheckLastBatch(), line 141: assertNumStateRows(2, 1) // We also processed the data 10, which is less than watermark ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark append-filter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16547 commit 86d088687b7da7f33022ee8693cc3fdd9228775b Author: Liwei LinDate: 2017-01-11T07:15:43Z Also examine `numRowsUpdated` in test commit 2a91e6f8612c01b61a4d501b22fbf2690fa36f4a Author: Liwei Lin Date: 2017-01-11T07:21:33Z Filter data less than watermark in `Append` mode commit c9f62c161ad94af35ad237673c85428ce6094ac5 Author: Liwei Lin Date: 2017-01-11T07:38:02Z Fix an error message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95525588 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("correctly resolve a nested view") { +withTempDatabase { db => + withView(s"$db.view1", s"$db.view2") { +val view1 = CatalogTable( + identifier = TableIdentifier("view1", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM jt"), + viewText = Some("SELECT * FROM jt"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) +val view2 = CatalogTable( + identifier = TableIdentifier("view2", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM view1"), + viewText = Some("SELECT * FROM view1"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> db}) +activateDatabase(db) { + hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = false) + hiveContext.sessionState.catalog.createTable(view2, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 9).map(i => Row(i, i))) +} + } +} + } + + test("correctly resolve a view with CTE") { +withView("cte_view") { + val cte_view = CatalogTable( +identifier = TableIdentifier("cte_view"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("n", "int"), +viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(cte_view, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM cte_view"), Row(1)) +} + } + + test("correctly resolve a view in a self join") { --- End diff -- Without `View` extending `MultiInstanceRelation `, it still works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95525456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -378,6 +379,35 @@ case class InsertIntoTable( } /** + * A container for holding the view description(CatalogTable), and the output of the view. The + * child should be a logical plan parsed from the `CatalogTable.viewText`, should throw an error + * if the `viewText` is not defined. + * This operator will be removed at the end of analysis stage. + * + * @param desc A view description(CatalogTable) that provides necessary information to resolve the + * view. + * @param output The output of a view operator, this is generated during planning the view, so that + * we are able to decouple the output from the underlying structure. + * @param child The logical plan of a view operator, it should be a logical plan parsed from the + * `CatalogTable.viewText`, should throw an error if the `viewText` is not defined. + */ +case class View( +desc: CatalogTable, +output: Seq[Attribute], +child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation { --- End diff -- I still cannot get the point why we need to extend `MultiInstanceRelation` here. We only do it for the leaf node, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16233 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71181/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16441 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16233 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16546: [WIP][SQL] Put check in ExpressionEncoder.fromRow to ens...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16546 **[Test build #71192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71192/testReport)** for PR 16546 at commit [`190fb62`](https://github.com/apache/spark/commit/190fb6222d84991000f91735952579b1e0686a61). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16441 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71187/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16233 **[Test build #71181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71181/testReport)** for PR 16233 at commit [`ff4b35f`](https://github.com/apache/spark/commit/ff4b35fe40b5e12f1a39c5129ec9a702d593457a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16441 **[Test build #71187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71187/testReport)** for PR 16441 at commit [`a6ef62e`](https://github.com/apache/spark/commit/a6ef62e65f76042d4ddfc0726125df12548efbaf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16395 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71180/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16395 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16546: [WIP][SQL] Put check in ExpressionEncoder.fromRow...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/16546 [WIP][SQL] Put check in ExpressionEncoder.fromRow to ensure we can convert deserialized object to required type ## What changes were proposed in this pull request? Two problems are addressed in this patch. 1. Serialize subclass of `Seq[_]` which doesn't have element type Currently, in `ScalaReflection.serializerFor`, we try to serialize all sub types of `Seq[_]`. But for `Range` which is a `Seq[Int]` and doesn't have element type, `serializerFor` will fail and show mystery messages: scala.MatchError: scala.collection.immutable.Range.Inclusive (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:520) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:463) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71) This patch tries to fix this by considering the types without element type. 2. Encoder can't deserialize internal row to required type We serialize the objects with common super class such as `Seq[_]` to a common internal data. But when we want to deserialize the internal data back to the original objects, we will encounter the problem of initialization of different types of objects. For example, we deserialize the data serialized from `Seq[_]` to `WrappedArray`. It works when we serialize data of `Seq[_]`. If we try to serialize data of subclass of `Seq[_]` (for example `Range`) which is not assignable from `WrappedArray`, there will be runtime error when converting deserialized data to the required subclass of `Seq[_]`. Except for explicitly writing down the rule to deserialize each subclass of `Seq[_]`, I think the feasible solution is to check if we can convert deserialized data to the required type. This patch puts the check into `ExpressionEncoder.fromRow`. Once the requirement is not matched, we show a reasonable message to users. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 encoder-range Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16546.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16546 commit 190fb6222d84991000f91735952579b1e0686a61 Author: Liang-Chi HsiehDate: 2017-01-11T03:38:44Z Put check in ExpressionEncoder.fromRow to ensure we can convert deserialized object to required type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16395 **[Test build #71180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71180/testReport)** for PR 16395 at commit [`210b11b`](https://github.com/apache/spark/commit/210b11b4aef5673ec0f98ee12d60bc05fd32e44d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524651 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524585 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524418 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524337 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524281 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524128 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95524086 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95523860 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. + * For logical AND conditions, we need to update stats after a condition estimation + * so that the stats will be more accurate for
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95523914 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. + * @return Option[Statistics] When there is no statistics collected, it returns None. + */ + def estimate(plan: Filter): Option[Statistics] = { +val stats: Statistics = plan.child.statistics +if (stats.rowCount.isEmpty) return None + +// save a mutable copy of colStats so that we can later change it recursively +val statsExprIdMap: Map[ExprId, ColumnStat] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._2)) +mutableColStats = mutable.Map.empty ++= statsExprIdMap + +// estimate selectivity of this filter predicate +val percent: Double = calculateConditions(plan, plan.condition) + +// attributeStats has mapping Attribute-to-ColumnStat. +// mutableColStats has mapping ExprId-to-ColumnStat. +// We use an ExprId-to-Attribute map to facilitate the mapping Attribute-to-ColumnStat +val expridToAttrMap: Map[ExprId, Attribute] = + stats.attributeStats.map(kv => (kv._1.exprId, kv._1)) +// copy mutableColStats contents to an immutable AttributeMap. +val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] = + mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2) +val newColStats = AttributeMap(mutableAttributeStats.toSeq) + +val filteredRowCountValue: BigInt = + EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent) +val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, newColStats)) +val filteredSizeInBytes: BigInt = + EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize) + +Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCountValue), + attributeStats = newColStats)) + } + + /** + * Returns a percentage of rows meeting a compound condition in Filter node. + * A compound condition is depomposed into multiple single conditions linked with AND, OR, NOT. --- End diff -- decomposed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95523806 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: --- End diff -- "the proedicates" -> "predicates" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95523768 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala --- @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import java.math.{BigDecimal => JDecimal} +import java.sql.{Date, Timestamp} + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types.{BooleanType, DateType, TimestampType, _} + + +/** Value range of a column. */ +trait Range + +/** For simplicity we use decimal to unify operations of numeric ranges. */ +case class NumericRange(min: JDecimal, max: JDecimal) extends Range + +/** + * This version of Spark does not have min/max for binary/string types, we define their default + * behaviors by this class. + */ +class DefaultRange extends Range + +/** This is for columns with only null values. */ +class NullRange extends Range + +object Range { + def apply(min: Option[Any], max: Option[Any], dataType: DataType): Range = dataType match { +case StringType | BinaryType => new DefaultRange() +case _ if min.isEmpty || max.isEmpty => new NullRange() +case _ => toNumericRange(min.get, max.get, dataType) + } + + /** + * For simplicity we use decimal to unify operations of numeric types, the two methods below + * are the contract of conversion. + */ + private def toNumericRange(min: Any, max: Any, dataType: DataType): NumericRange = { +dataType match { + case _: NumericType => +NumericRange(new JDecimal(min.toString), new JDecimal(max.toString)) + case BooleanType => +val min1 = if (min.asInstanceOf[Boolean]) 1 else 0 +val max1 = if (max.asInstanceOf[Boolean]) 1 else 0 +NumericRange(new JDecimal(min1), new JDecimal(max1)) + case DateType => +val min1 = DateTimeUtils.fromJavaDate(min.asInstanceOf[Date]) +val max1 = DateTimeUtils.fromJavaDate(max.asInstanceOf[Date]) +NumericRange(new JDecimal(min1), new JDecimal(max1)) + case TimestampType => +val min1 = DateTimeUtils.fromJavaTimestamp(min.asInstanceOf[Timestamp]) +val max1 = DateTimeUtils.fromJavaTimestamp(max.asInstanceOf[Timestamp]) +NumericRange(new JDecimal(min1), new JDecimal(max1)) + case _ => +throw new AnalysisException(s"Type $dataType is not castable to numeric in estimation.") --- End diff -- when we get here, is it an error in spark? if yes, we should probably throw UnsupportedOperationEception --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95523666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala --- @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import java.math.{BigDecimal => JDecimal} +import java.sql.{Date, Timestamp} + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types.{BooleanType, DateType, TimestampType, _} + + +/** Value range of a column. */ +trait Range + +/** For simplicity we use decimal to unify operations of numeric ranges. */ +case class NumericRange(min: JDecimal, max: JDecimal) extends Range + +/** + * This version of Spark does not have min/max for binary/string types, we define their default + * behaviors by this class. + */ +class DefaultRange extends Range + +/** This is for columns with only null values. */ +class NullRange extends Range + +object Range { + def apply(min: Option[Any], max: Option[Any], dataType: DataType): Range = dataType match { +case StringType | BinaryType => new DefaultRange() +case _ if min.isEmpty || max.isEmpty => new NullRange() +case _ => toNumericRange(min.get, max.get, dataType) + } + + /** + * For simplicity we use decimal to unify operations of numeric types, the two methods below + * are the contract of conversion. + */ + private def toNumericRange(min: Any, max: Any, dataType: DataType): NumericRange = { +dataType match { + case _: NumericType => +NumericRange(new JDecimal(min.toString), new JDecimal(max.toString)) + case BooleanType => +val min1 = if (min.asInstanceOf[Boolean]) 1 else 0 +val max1 = if (max.asInstanceOf[Boolean]) 1 else 0 +NumericRange(new JDecimal(min1), new JDecimal(max1)) + case DateType => +val min1 = DateTimeUtils.fromJavaDate(min.asInstanceOf[Date]) +val max1 = DateTimeUtils.fromJavaDate(max.asInstanceOf[Date]) +NumericRange(new JDecimal(min1), new JDecimal(max1)) + case TimestampType => +val min1 = DateTimeUtils.fromJavaTimestamp(min.asInstanceOf[Timestamp]) --- End diff -- can we make sure we have tests for date / timestamp types? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/16544 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522835 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(1L)) + } + + test("filter estimation with less than comparison") { +// the predicate is "WHERE key1 < 3" +val intValue = Literal(3, IntegerType) --- End diff -- same thing with the following test cases - can you make sure we have the proper coverage? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522818 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(1L)) + } + + test("filter estimation with less than comparison") { +// the predicate is "WHERE key1 < 3" +val intValue = Literal(3, IntegerType) --- End diff -- here we should also have a check using value > 10 shouldn't we? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522719 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { --- End diff -- also i'd try to make this more readable by just using this as the test case name: ``` test("key1 = 2") { } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522619 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) --- End diff -- ok i now read through the test case. to make it more readable, I'd write it like this: ``` validateEstimatedStats( Filter(EqualTo(ar, Literal(2)), child), ColumnStat(distinctCount = 1, min = Some(2), max = Some(2), nullCount = 0, avgLen = 4, maxLen = 4) ) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71179/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15435 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95522422 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 1) } } + + test("should be able to resolve a persistent view") { --- End diff -- I see. That means, this PR enables the view support without enabling Hive support. This test case is just covering a very basic case. We need to check more scenarios, like ALTER VIEW. Please remember this in the follow-up PRs. Also, update the PR description and mention this in a separate bullet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522395 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4) --- End diff -- also i'd rename `filteredColStats` to just `expected` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522373 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(1L)) + } + + test("filter estimation with less than comparison") { +// the predicate is "WHERE key1 < 3" +val intValue = Literal(3, IntegerType) +val condition = LessThan(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(2, Some(1), Some(3), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(3L)) + } + + test("filter estimation with less than or equal to comparison") { +// the predicate is "WHERE key1 <= 3" +val intValue = Literal(3, IntegerType) +val condition = LessThanOrEqual(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(2, Some(1), Some(3), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(3L)) + + } + + test("filter estimation with greater than comparison") { +// the predicate is "WHERE key1 > 6" +val intValue = Literal(6, IntegerType) +val condition = GreaterThan(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(4, Some(6), Some(10), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(5L)) + } + + test("filter estimation with greater than or equal to comparison") { +// the predicate is "WHERE key1 >= 6" +val intValue = Literal(6, IntegerType) +val condition = GreaterThanOrEqual(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(4, Some(6), Some(10), 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(5L)) + + } + + test("filter estimation with IS NULL comparison") { +// the predicate is "WHERE key1 IS NULL" +val condition = IsNull(ar) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(0, None, None, 0, 4, 4) + +validateEstimatedStats(filterNode, filteredColStats, Some(0L)) + } + + test("filter estimation with IS NOT NULL comparison") { +// the predicate is "WHERE key1 IS NOT NULL" +val condition =
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522411 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) --- End diff -- just `filter` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15435 **[Test build #71179 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71179/testReport)** for PR 15435 at commit [`3101f29`](https://github.com/apache/spark/commit/3101f2905f314cbfa1df6383708a4438adbcff00). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522307 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) --- End diff -- this is just ``` val condition = EqualTo(ar, Literal(2)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522247 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._ +import org.apache.spark.sql.types.IntegerType + +/** + * In this test suite, we test the proedicates containing the following operators: + * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN + */ + +class FilterEstimationSuite extends StatsEstimationTestBase { + + // Suppose our test table has one column called "key1". + // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, maxLen:4 + val ar = AttributeReference("key1", IntegerType)() + val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4) + val child = StatsTestPlan( +outputList = Seq(ar), +stats = Statistics( + sizeInBytes = 10 * 4, + rowCount = Some(10), + attributeStats = AttributeMap(Seq(ar -> childColStat)) +) + ) + + test("filter estimation with equality comparison") { +// the predicate is "WHERE key1 = 2" +val intValue = Literal(2, IntegerType) +val condition = EqualTo(ar, intValue) +val filterNode = Filter(condition, child) +val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4) --- End diff -- as i commented on the other pr, i think we should use named arguments here so readers would know what 0, 4 ,4 means. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95522065 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -0,0 +1,555 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import scala.collection.immutable.{HashSet, Map} +import scala.collection.mutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +class FilterEstimation extends Logging { + + /** + * We use a mutable colStats because we need to update the corresponding ColumnStat + * for a column after we apply a predicate condition. For example, A column c has + * [min, max] value as [0, 100]. In a range condition such as (c > 40 AND c <= 50), + * we need to set the column's [min, max] value to [40, 100] after we evaluate the + * first condition c > 40. We need to set the column's [min, max] value to [40, 50] + * after we evaluate the second condition c <= 50. + */ + private var mutableColStats: mutable.Map[ExprId, ColumnStat] = mutable.Map.empty + + /** + * Returns an option of Statistics for a Filter logical plan node. + * For a given compound expression condition, this method computes filter selectivity + * (or the percentage of rows meeting the filter condition), which + * is used to compute row count, size in bytes, and the updated statistics after a given + * predicated is applied. + * + * @param plan a LogicalPlan node that must be an instance of Filter. --- End diff -- we can probably remove this since it doesn't really carry any information ... (plan's type is already Filter) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71177/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15435 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15435 **[Test build #71177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71177/testReport)** for PR 15435 at commit [`3fb4bfc`](https://github.com/apache/spark/commit/3fb4bfcce42d07704c8da889f9ff82416adcbd08). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16528 **[Test build #71191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71191/testReport)** for PR 16528 at commit [`0d1baf1`](https://github.com/apache/spark/commit/0d1baf1f3bc92e52ed6f29b0b06db5979ff0babf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16545: [SPARK-19166][SQL]rename from InsertIntoHadoopFsRelation...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16545 **[Test build #71190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71190/testReport)** for PR 16545 at commit [`0622c04`](https://github.com/apache/spark/commit/0622c04c6129ef699ca0d8f6907d8bbc6d025387). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16545: [SPARK-19166][SQL]rename from InsertIntoHadoopFsR...
GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/16545 [SPARK-19166][SQL]rename from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix ## What changes were proposed in this pull request? InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files that match a static prefix, such as a partition file path(/table/foo=1), or a no partition file path(/xxx/a.json). while the method name deleteMatchingPartitions indicates that only the partition file will be deleted. This name make a confused. It is better to rename the method name. ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark modifyAMethodName Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16545.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16545 commit 0622c04c6129ef699ca0d8f6907d8bbc6d025387 Author: windpigerDate: 2017-01-11T06:59:22Z [SPARK-19166][SQL]change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16395#discussion_r95521090 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -52,3 +56,12 @@ object EstimationUtils { }.sum } } + +/** Attribute Reference extractor */ +object ExtractAttr { --- End diff -- is this necessary? isn't this just ``` case op @ EqualTo(ar: AttributeReference, l: Literal) => ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16544 **[Test build #71189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71189/testReport)** for PR 16544 at commit [`622d62b`](https://github.com/apache/spark/commit/622d62b8c1659f420f29cc36729f7cc5957a9027). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16544: [SPARK-19149][SQL] Follow-up: simplify cache impl...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/16544 [SPARK-19149][SQL] Follow-up: simplify cache implementation. ## What changes were proposed in this pull request? This patch simplifies slightly the logical plan statistics cache implementation, as discussed in https://github.com/apache/spark/pull/16529 ## How was this patch tested? N/A - this has no behavior change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-19149 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16544.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16544 commit 622d62b8c1659f420f29cc36729f7cc5957a9027 Author: Reynold XinDate: 2017-01-11T06:45:49Z [SPARK-19149][SQL] Follow-up: simplify cache implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16544 cc @wzhfy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71184/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16543 **[Test build #71184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71184/consoleFull)** for PR 16543 at commit [`b7f934a`](https://github.com/apache/spark/commit/b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16543 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95520270 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast cann't perform, will throw an AnalysisException. +Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, --- End diff -- Note that all tests can pass without this `Cast`, but it does fix a weird behavior: the result of a view query may have different schema if the view definition has been changed. shall we pull it out into a follow-up PR or do it here? cc @hvanhovell @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16269: [SPARK-19080][SQL] simplify data source analysis
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16269 **[Test build #71188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71188/testReport)** for PR 16269 at commit [`87be209`](https://github.com/apache/spark/commit/87be2096d43ae4c9083c2e262c0816bca03dda32). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95519850 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 1) } } + + test("should be able to resolve a persistent view") { --- End diff -- We don't define the behavior to resolve a view using a SQLContext in current master, this test case is to define that behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16529: [SPARK-19149] [SQL] Unify two sets of statistics in Logi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16529 Actually the multi-threaded issue probably doesn't matter. I will just change it to the original Option implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16529: [SPARK-19149] [SQL] Unify two sets of statistics ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16529 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/16441 thank you @sethah, I've updated the PR based on your latest comments. @jkbradley would you be able to take a look when you have time? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95519288 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 1) } } + + test("should be able to resolve a persistent view") { --- End diff -- What is your goal for this test case? Any reason? BTW, we should move this test case to `SQLQueryTestSuite`. Now, we are trying to migrate such test cases to there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16529: [SPARK-19149] [SQL] Unify two sets of statistics in Logi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16529 Merging in master. I will fix the thread local thing in a pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16441 **[Test build #71187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71187/testReport)** for PR 16441 at commit [`a6ef62e`](https://github.com/apache/spark/commit/a6ef62e65f76042d4ddfc0726125df12548efbaf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16510 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16510 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71183/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16510 **[Test build #71183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71183/testReport)** for PR 16510 at commit [`b10ae77`](https://github.com/apache/spark/commit/b10ae7705bb91f6885b6c551b8e572cf72b34970). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16529: [SPARK-19149] [SQL] Unify two sets of statistics ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16529#discussion_r95518724 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -81,44 +81,36 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { } } + /** A cache for the estimated statistics, such that it will only be computed once. */ + private val statsCache = new ThreadLocal[Option[Statistics]] { --- End diff -- ok i was thinking you could just use an AtomicReference --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/16441 with regards to the loss type, I think the real issue is that the user shouldn't be able to change the loss type at all on the model, as with many other parameters. It seems strange to have the model and trainer share the same parameters in that case. I think you are correct that users will never change the loss on the model in the future and expect the probability function to change, but just the fact that they can for some reason and it doesn't bothers me, but it's not a significant issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16249: [SPARK-18828][SPARKR] Refactor scripts for R
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16249 **[Test build #71186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71186/testReport)** for PR 16249 at commit [`66fc83c`](https://github.com/apache/spark/commit/66fc83cb349f5cd8a34ed3d272b8e0ab7b0fe423). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16249: [SPARK-18828][SPARKR] Refactor scripts for R
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16249 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95518300 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- How about to add in doc: "the actual number partition can be increased as multiples of spark.r.maxAllocationLimit, or limited by the number of columns in data.frame." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95518069 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -66,10 +72,157 @@ class GBTClassifierSuite extends SparkFunSuite with MLlibTestSparkContext ParamsSuite.checkParams(new GBTClassifier) val model = new GBTClassificationModel("gbtc", Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, null), 1)), - Array(1.0), 1) + Array(1.0), 1, 2) ParamsSuite.checkParams(model) } + test("GBTClassifier: default params") { +val gbt = new GBTClassifier +assert(gbt.getLabelCol === "label") +assert(gbt.getFeaturesCol === "features") +assert(gbt.getPredictionCol === "prediction") +assert(gbt.getRawPredictionCol === "rawPrediction") +assert(gbt.getProbabilityCol === "probability") +val df = trainData.toDF() +val model = gbt.fit(df) +model.transform(df) + .select("label", "probability", "prediction", "rawPrediction") + .collect() +intercept[NoSuchElementException] { + model.getThresholds +} +assert(model.getFeaturesCol === "features") +assert(model.getPredictionCol === "prediction") +assert(model.getRawPredictionCol === "rawPrediction") +assert(model.getProbabilityCol === "probability") +assert(model.hasParent) + +// copied model must have the same parent. +MLTestingUtils.checkCopy(model) + } + + test("setThreshold, getThreshold") { +val gbt = new GBTClassifier + +// default +withClue("GBTClassifier should not have thresholds set by default.") { + intercept[NoSuchElementException] { +gbt.getThresholds + } +} + +// Set via thresholds +val gbt2 = new GBTClassifier +val threshold = Array(0.3, 0.7) +gbt2.setThresholds(threshold) +assert(gbt2.getThresholds.zipWithIndex.forall(valueWithIndex => --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95517753 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("correctly resolve a nested view") { +withTempDatabase { db => + withView(s"$db.view1", s"$db.view2") { +val view1 = CatalogTable( + identifier = TableIdentifier("view1", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM jt"), + viewText = Some("SELECT * FROM jt"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) +val view2 = CatalogTable( + identifier = TableIdentifier("view2", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM view1"), + viewText = Some("SELECT * FROM view1"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> db}) +activateDatabase(db) { + hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = false) + hiveContext.sessionState.catalog.createTable(view2, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 9).map(i => Row(i, i))) +} + } +} + } + + test("correctly resolve a view with CTE") { +withView("cte_view") { + val cte_view = CatalogTable( +identifier = TableIdentifier("cte_view"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("n", "int"), +viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(cte_view, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM cte_view"), Row(1)) +} + } + + test("correctly resolve a view in a self join") { +withView("join_view") { + val join_view = CatalogTable( +identifier = TableIdentifier("join_view"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM jt"), +viewText = Some("SELECT * FROM jt"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(join_view, ignoreIfExists = false) + checkAnswer( +sql("SELECT * FROM join_view t1 JOIN join_view t2 ON t1.id = t2.id ORDER BY t1.id"), +(1 to 9).map(i => Row(i, i, i, i))) +} + } + + private def assertInvalidReference(query: String): Unit = { +val e = intercept[AnalysisException] { + sql(query) +}.getMessage +assert(e.contains("Table or view not found")) + } + + test("error handling: fail if the referenced table or view is invalid") { +withView("view1", "view2", "view3") { + // Fail if the referenced table is defined in a invalid database. + val view1 = CatalogTable( +identifier = TableIdentifier("view1"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM invalid_db.jt"), +viewText = Some("SELECT * FROM invalid_db.jt"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = false) + assertInvalidReference("SELECT * FROM view1") + + // Fail if the referenced table is invalid. + val view2 = CatalogTable( +identifier = TableIdentifier("view2"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM invalid_table"), +viewText =
[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95517754 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -20,6 +20,12 @@ package org.apache.spark.mllib.tree.loss import org.apache.spark.annotation.{DeveloperApi, Since} import org.apache.spark.mllib.util.MLUtils +/** + * Trait for adding probability function for the gradient boosting algorithm. --- End diff -- moved to Loss.scala. Removed doc on class. Added doc for computeProbability method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15505 **[Test build #71185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71185/testReport)** for PR 15505 at commit [`4ab93ee`](https://github.com/apache/spark/commit/4ab93eec0ccdab1b1141fc3eb0996fa4dbaf92d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95517520 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -275,18 +316,33 @@ class GBTClassificationModel private[ml]( @Since("2.0.0") lazy val featureImportances: Vector = TreeEnsembleModel.featureImportances(trees, numFeatures) + private def margin(features: Vector): Double = { +val treePredictions = _trees.map(_.rootNode.predictImpl(features).prediction) +blas.ddot(numTrees, treePredictions, 1, _treeWeights, 1) + } + /** (private[ml]) Convert to a model in the old API */ private[ml] def toOld: OldGBTModel = { new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), _treeWeights) } + /** + * Note: this is currently an optimization that should be removed when we have more loss + * functions available than only logistic. + */ + private lazy val loss = getOldLossType --- End diff -- removed lazy, removed comment. I made it lazy so as to not do the lookup if it doesn't need to be done, but since that isn't actually expensive and that only seemed to confuse it's better to remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95517316 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -275,18 +316,33 @@ class GBTClassificationModel private[ml]( @Since("2.0.0") lazy val featureImportances: Vector = TreeEnsembleModel.featureImportances(trees, numFeatures) + private def margin(features: Vector): Double = { --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16543 **[Test build #71184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71184/consoleFull)** for PR 16543 at commit [`b7f934a`](https://github.com/apache/spark/commit/b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16473 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71176/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm f...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/16543 [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamma, clarify glm family supported ## What changes were proposed in this pull request? Backport to 2.0 (cherry picking from 2.1 didn't work) ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rgammabackport20 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16543.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16543 commit b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7 Author: Felix CheungDate: 2017-01-11T06:02:44Z fix Gamma family --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16473 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16473 **[Test build #71176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71176/testReport)** for PR 16473 at commit [`0df715d`](https://github.com/apache/spark/commit/0df715da04fb8349d8f2c1040c76fc92e1e7ad83). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95515867 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("correctly resolve a nested view") { +withTempDatabase { db => + withView(s"$db.view1", s"$db.view2") { +val view1 = CatalogTable( + identifier = TableIdentifier("view1", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM jt"), + viewText = Some("SELECT * FROM jt"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) +val view2 = CatalogTable( + identifier = TableIdentifier("view2", Some(db)), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType().add("id", "int").add("id1", "int"), + viewOriginalText = Some("SELECT * FROM view1"), + viewText = Some("SELECT * FROM view1"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> db}) +activateDatabase(db) { + hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = false) + hiveContext.sessionState.catalog.createTable(view2, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 9).map(i => Row(i, i))) +} + } +} + } + + test("correctly resolve a view with CTE") { +withView("cte_view") { + val cte_view = CatalogTable( +identifier = TableIdentifier("cte_view"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("n", "int"), +viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(cte_view, ignoreIfExists = false) + checkAnswer(sql("SELECT * FROM cte_view"), Row(1)) +} + } + + test("correctly resolve a view in a self join") { +withView("join_view") { + val join_view = CatalogTable( +identifier = TableIdentifier("join_view"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM jt"), +viewText = Some("SELECT * FROM jt"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(join_view, ignoreIfExists = false) + checkAnswer( +sql("SELECT * FROM join_view t1 JOIN join_view t2 ON t1.id = t2.id ORDER BY t1.id"), +(1 to 9).map(i => Row(i, i, i, i))) +} + } + + private def assertInvalidReference(query: String): Unit = { +val e = intercept[AnalysisException] { + sql(query) +}.getMessage +assert(e.contains("Table or view not found")) + } + + test("error handling: fail if the referenced table or view is invalid") { +withView("view1", "view2", "view3") { + // Fail if the referenced table is defined in a invalid database. + val view1 = CatalogTable( +identifier = TableIdentifier("view1"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM invalid_db.jt"), +viewText = Some("SELECT * FROM invalid_db.jt"), +properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> "default"}) + hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = false) + assertInvalidReference("SELECT * FROM view1") + + // Fail if the referenced table is invalid. + val view2 = CatalogTable( +identifier = TableIdentifier("view2"), +tableType = CatalogTableType.VIEW, +storage = CatalogStorageFormat.empty, +schema = new StructType().add("id", "int").add("id1", "int"), +viewOriginalText = Some("SELECT * FROM invalid_table"), +viewText =
[GitHub] spark issue #4775: [SPARK-6016][SQL] Cannot read the parquet table after ove...
Github user karthikgolagani commented on the issue: https://github.com/apache/spark/pull/4775 @liancheng Hi lian, if you are using sparkContext(sc), you can set ("parquet.enable.summary-metadata", "false") like below: sc.("parquet.enable.summary-metadata", "false"). This fixed my issue instantly . I did it in my spark streaming application. > WARN ParquetOutputCommitter: could not write summary file for hdfs://localhost/user/hive/warehouse java.lang.NullPointerException --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95515728 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast cann't perform, will throw an AnalysisException. --- End diff -- Nit: `cann't ` -> `can't ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16510: [SPARK-19130][SPARKR] Support setting literal val...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16510#discussion_r95515567 --- Diff: R/pkg/R/DataFrame.R --- @@ -1727,14 +1727,21 @@ setMethod("$", signature(x = "SparkDataFrame"), getColumn(x, name) }) -#' @param value a Column or \code{NULL}. If \code{NULL}, the specified Column is dropped. +#' @param value a Column or an atomic vector in the length of 1 as literal value, or \code{NULL}. +#' If \code{NULL}, the specified Column is dropped. #' @rdname select #' @name $<- #' @aliases $<-,SparkDataFrame-method #' @note $<- since 1.4.0 setMethod("$<-", signature(x = "SparkDataFrame"), function(x, name, value) { -stopifnot(class(value) == "Column" || is.null(value)) +if (class(value) != "Column" && !is.null(value)) { + if (isAtomicLengthOne(value)) { +value <- lit(value) + } else { +stop("value must be a Column, Literal value as atomic in length of 1, or NULL") --- End diff -- I was thinking org.apache.spark.sql.catalyst.expressions.Literal as the type, but I guess i can use the lower case term too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16510 **[Test build #71183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71183/testReport)** for PR 16510 at commit [`b10ae77`](https://github.com/apache/spark/commit/b10ae7705bb91f6885b6c551b8e572cf72b34970). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95515250 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +545,87 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name(in AnalysisContext) if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use +// `AnalysisContext.defaultDatabase` to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to +// `AnalysisContext.defaultDatabase`, we look up the relations that the view references using +// the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we used is decided by the +// precedence: +// 1. Use the database part of the table identifier, if it is defined; +// 2. Use defaultDatabase, if it is defined(In this case, no temporary objects can be used, +//and the default database is only used to look up a view); +// 3. Use the currentDb of the SessionCatalog. +private def lookupTableFromCatalog( +u: UnresolvedRelation, +defaultDatabase: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +val tableIdentWithDb = u.tableIdentifier.copy( + database = u.tableIdentifier.database.orElse(defaultDatabase)) +catalog.lookupRelation(tableIdentWithDb, u.alias) } catch { case _: NoSuchTableException => u.failAnalysis(s"Table or view not found: ${u.tableName}") } } -def apply(plan: LogicalPlan):
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95514996 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala --- @@ -860,6 +864,24 @@ abstract class CatalogTestUtils { bucketSpec = Some(BucketSpec(4, Seq("col1"), Nil))) } + def newView( + name: String, + database: Option[String] = None): CatalogTable = { +val viewDefaultDatabase = database.getOrElse("default") +CatalogTable( + identifier = TableIdentifier(name, database), + tableType = CatalogTableType.VIEW, + storage = CatalogStorageFormat.empty, + schema = new StructType() +.add("col1", "int") +.add("col2", "string") +.add("a", "int") +.add("b", "string"), + viewOriginalText = Some("SELECT * FROM tbl1"), + viewText = Some("SELECT * FROM tbl1"), + properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE -> viewDefaultDatabase}) --- End diff -- nit: `Map(CatalogTable.VIEW_DEFAULT_DATABASE -> viewDefaultDatabase)`, scala comiler will infer the type for us --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95514898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -378,6 +379,35 @@ case class InsertIntoTable( } /** + * A container for holding the view description(CatalogTable), and the output of the view. The + * child should be a logical plan parsed from the `CatalogTable.viewText`, should throw an error + * if the `viewText` is not defined. + * This operator will be removed at the end of analysis stage. + * + * @param desc A view description(CatalogTable) that provides necessary information to resolve the + * view. + * @param output The output of a view operator, this is generated during planning the view, so that + * we are able to decouple the output from the underlying structure. + * @param child The logical plan of a view operator, it should be a logical plan parsed from the + * `CatalogTable.viewText`, should throw an error if the `viewText` is not defined. + */ +case class View( +desc: CatalogTable, +output: Seq[Attribute], +child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation { --- End diff -- nit: extends `UnaryNode`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95514867 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast cann't perform, will throw an AnalysisException. +Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, + qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { --- End diff -- nit: use `find`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95514830 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- it looks like it is intentional: https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L115 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16531: [SPARK-19157][SQL] should be able to change spark...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16531 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16516: [SPARK-19155][ML] Make some string params of ML algorith...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16516 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71178/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org