date:20170110

[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16547
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16547
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71193/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16547
  
**[Test build #71193 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71193/testReport)**
 for PR 16547 at commit 
[`c9f62c1`](https://github.com/apache/spark/commit/c9f62c161ad94af35ad237673c85428ce6094ac5).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16547: [SPARK-19168][Structured Streaming] Improvement: filter ...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16547
  
**[Test build #71193 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71193/testReport)**
 for PR 16547 at commit 
[`c9f62c1`](https://github.com/apache/spark/commit/c9f62c161ad94af35ad237673c85428ce6094ac5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95526238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -378,6 +379,35 @@ case class InsertIntoTable(
 }
 
 /**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child should be a logical plan parsed from the `CatalogTable.viewText`, 
should throw an error
+ * if the `viewText` is not defined.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be a 
logical plan parsed from the
+ *  `CatalogTable.viewText`, should throw an error if the 
`viewText` is not defined.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation {
--- End diff --

We only extend `MultiInstanceRelation` for the leaf node. Any reason why it 
is needed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16547: [SPARK-19168][Structured Streaming] Improvement: ...

2017-01-10 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/16547

[SPARK-19168][Structured Streaming] Improvement: filter late data using 
watermark for `Append` mode

## What changes were proposed in this pull request?

Currently we're filtering late data using watermark for `Update` mode; 
maybe we should do the same for `Append` mode.

Note this is an improvement rather than correctness fix, because the 
current behavior of `Append` mode is quite correct even without this.

## How was this patch tested?

commit #1 of this patch added `numRowsUpdated` checks in 
`EventTimeWatermarkSuite.scala`:

```scala
line 139:  AddData(inputData, 10),
line 140:  CheckLastBatch(),
line 141:  assertNumStateRows(2, 1)  // We also processed the data 10, 
which is less than watermark
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark append-filter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16547


commit 86d088687b7da7f33022ee8693cc3fdd9228775b
Author: Liwei Lin 
Date:   2017-01-11T07:15:43Z

Also examine `numRowsUpdated` in test

commit 2a91e6f8612c01b61a4d501b22fbf2690fa36f4a
Author: Liwei Lin 
Date:   2017-01-11T07:21:33Z

Filter data less than watermark in `Append` mode

commit c9f62c161ad94af35ad237673c85428ce6094ac5
Author: Liwei Lin 
Date:   2017-01-11T07:38:02Z

Fix an error message




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95525588
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 
---
@@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("correctly resolve a nested view") {
+withTempDatabase { db =>
+  withView(s"$db.view1", s"$db.view2") {
+val view1 = CatalogTable(
+  identifier = TableIdentifier("view1", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM jt"),
+  viewText = Some("SELECT * FROM jt"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+val view2 = CatalogTable(
+  identifier = TableIdentifier("view2", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM view1"),
+  viewText = Some("SELECT * FROM view1"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> db})
+activateDatabase(db) {
+  hiveContext.sessionState.catalog.createTable(view1, 
ignoreIfExists = false)
+  hiveContext.sessionState.catalog.createTable(view2, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 
9).map(i => Row(i, i)))
+}
+  }
+}
+  }
+
+  test("correctly resolve a view with CTE") {
+withView("cte_view") {
+  val cte_view = CatalogTable(
+identifier = TableIdentifier("cte_view"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("n", "int"),
+viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM 
w"),
+viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(cte_view, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM cte_view"), Row(1))
+}
+  }
+
+  test("correctly resolve a view in a self join") {
--- End diff --

Without `View` extending `MultiInstanceRelation `, it still works. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95525456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -378,6 +379,35 @@ case class InsertIntoTable(
 }
 
 /**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child should be a logical plan parsed from the `CatalogTable.viewText`, 
should throw an error
+ * if the `viewText` is not defined.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be a 
logical plan parsed from the
+ *  `CatalogTable.viewText`, should throw an error if the 
`viewText` is not defined.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation {
--- End diff --

I still cannot get the point why we need to extend `MultiInstanceRelation` 
here. We only do it for the leaf node, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16233
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71181/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16441
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16233
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16546: [WIP][SQL] Put check in ExpressionEncoder.fromRow to ens...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16546
  
**[Test build #71192 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71192/testReport)**
 for PR 16546 at commit 
[`190fb62`](https://github.com/apache/spark/commit/190fb6222d84991000f91735952579b1e0686a61).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16441
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71187/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16233
  
**[Test build #71181 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71181/testReport)**
 for PR 16233 at commit 
[`ff4b35f`](https://github.com/apache/spark/commit/ff4b35fe40b5e12f1a39c5129ec9a702d593457a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16441
  
**[Test build #71187 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71187/testReport)**
 for PR 16441 at commit 
[`a6ef62e`](https://github.com/apache/spark/commit/a6ef62e65f76042d4ddfc0726125df12548efbaf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16395
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71180/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16395
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16546: [WIP][SQL] Put check in ExpressionEncoder.fromRow...

2017-01-10 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/16546

[WIP][SQL] Put check in ExpressionEncoder.fromRow to ensure we can convert 
deserialized object to required type

## What changes were proposed in this pull request?

Two problems are addressed in this patch.

1. Serialize subclass of `Seq[_]` which doesn't have element type

Currently, in `ScalaReflection.serializerFor`, we try to serialize all sub 
types of `Seq[_]`. But for `Range` which is a `Seq[Int]` and doesn't have 
element type, `serializerFor` will fail and show mystery messages:

scala.MatchError: scala.collection.immutable.Range.Inclusive (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:520)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:463)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
 

This patch tries to fix this by considering the types without element type.

2. Encoder can't deserialize internal row to required type

We serialize the objects with common super class such as `Seq[_]` to a 
common internal data. But when we want to deserialize the internal data back to 
the original objects, we will encounter the problem of initialization of 
different types of objects.

For example, we deserialize the data serialized from `Seq[_]` to 
`WrappedArray`. It works when we serialize data of `Seq[_]`. If we try to 
serialize data of subclass of `Seq[_]` (for example `Range`) which is not 
assignable from `WrappedArray`, there will be runtime error when converting 
deserialized data to the required subclass of `Seq[_]`.

Except for explicitly writing down the rule to deserialize each subclass of 
`Seq[_]`, I think the feasible solution is to check if we can convert 
deserialized data to the required type. This patch puts the check into 
`ExpressionEncoder.fromRow`. Once the requirement is not matched, we show a 
reasonable message to users.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 encoder-range

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16546


commit 190fb6222d84991000f91735952579b1e0686a61
Author: Liang-Chi Hsieh 
Date:   2017-01-11T03:38:44Z

Put check in ExpressionEncoder.fromRow to ensure we can convert 
deserialized object to required type.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16395
  
**[Test build #71180 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71180/testReport)**
 for PR 16395 at commit 
[`210b11b`](https://github.com/apache/spark/commit/210b11b4aef5673ec0f98ee12d60bc05fd32e44d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524651
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524585
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524418
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524337
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524128
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95524086
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95523860
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For logical AND conditions, we need to update stats after a condition 
estimation
+   * so that the stats will be more accurate for

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95523914
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate(plan: Filter): Option[Statistics] = {
+val stats: Statistics = plan.child.statistics
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+val statsExprIdMap: Map[ExprId, ColumnStat] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._2))
+mutableColStats = mutable.Map.empty ++= statsExprIdMap
+
+// estimate selectivity of this filter predicate
+val percent: Double = calculateConditions(plan, plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCountValue: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * percent)
+val avgRowSize = BigDecimal(EstimationUtils.getRowSize(plan.output, 
newColStats))
+val filteredSizeInBytes: BigInt =
+  EstimationUtils.ceil(BigDecimal(filteredRowCountValue) * avgRowSize)
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCountValue),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is depomposed into multiple single conditions 
linked with AND, OR, NOT.
--- End diff --

decomposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95523806
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
--- End diff --

"the proedicates" -> "predicates"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95523768
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala
 ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import java.math.{BigDecimal => JDecimal}
+import java.sql.{Date, Timestamp}
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{BooleanType, DateType, TimestampType, _}
+
+
+/** Value range of a column. */
+trait Range
+
+/** For simplicity we use decimal to unify operations of numeric ranges. */
+case class NumericRange(min: JDecimal, max: JDecimal) extends Range
+
+/**
+ * This version of Spark does not have min/max for binary/string types, we 
define their default
+ * behaviors by this class.
+ */
+class DefaultRange extends Range
+
+/** This is for columns with only null values. */
+class NullRange extends Range
+
+object Range {
+  def apply(min: Option[Any], max: Option[Any], dataType: DataType): Range 
= dataType match {
+case StringType | BinaryType => new DefaultRange()
+case _ if min.isEmpty || max.isEmpty => new NullRange()
+case _ => toNumericRange(min.get, max.get, dataType)
+  }
+
+  /**
+   * For simplicity we use decimal to unify operations of numeric types, 
the two methods below
+   * are the contract of conversion.
+   */
+  private def toNumericRange(min: Any, max: Any, dataType: DataType): 
NumericRange = {
+dataType match {
+  case _: NumericType =>
+NumericRange(new JDecimal(min.toString), new 
JDecimal(max.toString))
+  case BooleanType =>
+val min1 = if (min.asInstanceOf[Boolean]) 1 else 0
+val max1 = if (max.asInstanceOf[Boolean]) 1 else 0
+NumericRange(new JDecimal(min1), new JDecimal(max1))
+  case DateType =>
+val min1 = DateTimeUtils.fromJavaDate(min.asInstanceOf[Date])
+val max1 = DateTimeUtils.fromJavaDate(max.asInstanceOf[Date])
+NumericRange(new JDecimal(min1), new JDecimal(max1))
+  case TimestampType =>
+val min1 = 
DateTimeUtils.fromJavaTimestamp(min.asInstanceOf[Timestamp])
+val max1 = 
DateTimeUtils.fromJavaTimestamp(max.asInstanceOf[Timestamp])
+NumericRange(new JDecimal(min1), new JDecimal(max1))
+  case _ =>
+throw new AnalysisException(s"Type $dataType is not castable to 
numeric in estimation.")
--- End diff --

when we get here, is it an error in spark? if yes, we should probably throw 
UnsupportedOperationEception


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95523666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala
 ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import java.math.{BigDecimal => JDecimal}
+import java.sql.{Date, Timestamp}
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{BooleanType, DateType, TimestampType, _}
+
+
+/** Value range of a column. */
+trait Range
+
+/** For simplicity we use decimal to unify operations of numeric ranges. */
+case class NumericRange(min: JDecimal, max: JDecimal) extends Range
+
+/**
+ * This version of Spark does not have min/max for binary/string types, we 
define their default
+ * behaviors by this class.
+ */
+class DefaultRange extends Range
+
+/** This is for columns with only null values. */
+class NullRange extends Range
+
+object Range {
+  def apply(min: Option[Any], max: Option[Any], dataType: DataType): Range 
= dataType match {
+case StringType | BinaryType => new DefaultRange()
+case _ if min.isEmpty || max.isEmpty => new NullRange()
+case _ => toNumericRange(min.get, max.get, dataType)
+  }
+
+  /**
+   * For simplicity we use decimal to unify operations of numeric types, 
the two methods below
+   * are the contract of conversion.
+   */
+  private def toNumericRange(min: Any, max: Any, dataType: DataType): 
NumericRange = {
+dataType match {
+  case _: NumericType =>
+NumericRange(new JDecimal(min.toString), new 
JDecimal(max.toString))
+  case BooleanType =>
+val min1 = if (min.asInstanceOf[Boolean]) 1 else 0
+val max1 = if (max.asInstanceOf[Boolean]) 1 else 0
+NumericRange(new JDecimal(min1), new JDecimal(max1))
+  case DateType =>
+val min1 = DateTimeUtils.fromJavaDate(min.asInstanceOf[Date])
+val max1 = DateTimeUtils.fromJavaDate(max.asInstanceOf[Date])
+NumericRange(new JDecimal(min1), new JDecimal(max1))
+  case TimestampType =>
+val min1 = 
DateTimeUtils.fromJavaTimestamp(min.asInstanceOf[Timestamp])
--- End diff --

can we make sure we have tests for date / timestamp types?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...

2017-01-10 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16544
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522835
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(1L))
+  }
+
+  test("filter estimation with less than comparison") {
+// the predicate is "WHERE key1 < 3"
+val intValue = Literal(3, IntegerType)
--- End diff --

same thing with the following test cases - can you make sure we have the 
proper coverage?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522818
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(1L))
+  }
+
+  test("filter estimation with less than comparison") {
+// the predicate is "WHERE key1 < 3"
+val intValue = Literal(3, IntegerType)
--- End diff --

here we should also have a check using value > 10 shouldn't we?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522719
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
--- End diff --

also i'd try to make this more readable by just using this as the test case 
name:

```
test("key1 = 2") {
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522619
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
--- End diff --

ok i now read through the test case. to make it more readable, I'd write it 
like this:

```
validateEstimatedStats(
  Filter(EqualTo(ar, Literal(2)), child),
  ColumnStat(distinctCount = 1, min = Some(2), max = Some(2), nullCount = 
0, avgLen = 4, maxLen = 4)
)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15435
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71179/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15435
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95522422
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 
1)
 }
   }
+
+  test("should be able to resolve a persistent view") {
--- End diff --

I see. That means, this PR enables the view support without enabling Hive 
support. This test case is just covering a very basic case. We need to check 
more scenarios, like ALTER VIEW. Please remember this in the follow-up PRs. 

Also, update the PR description and mention this in a separate bullet. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522395
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4)
--- End diff --

also i'd rename `filteredColStats` to just `expected`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522373
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(1L))
+  }
+
+  test("filter estimation with less than comparison") {
+// the predicate is "WHERE key1 < 3"
+val intValue = Literal(3, IntegerType)
+val condition = LessThan(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(2, Some(1), Some(3), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(3L))
+  }
+
+  test("filter estimation with less than or equal to comparison") {
+// the predicate is "WHERE key1 <= 3"
+val intValue = Literal(3, IntegerType)
+val condition = LessThanOrEqual(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(2, Some(1), Some(3), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(3L))
+
+  }
+
+  test("filter estimation with greater than comparison") {
+// the predicate is "WHERE key1 > 6"
+val intValue = Literal(6, IntegerType)
+val condition = GreaterThan(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(4, Some(6), Some(10), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(5L))
+  }
+
+  test("filter estimation with greater than or equal to comparison") {
+// the predicate is "WHERE key1 >= 6"
+val intValue = Literal(6, IntegerType)
+val condition = GreaterThanOrEqual(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(4, Some(6), Some(10), 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(5L))
+
+  }
+
+  test("filter estimation with IS NULL comparison") {
+// the predicate is "WHERE key1 IS NULL"
+val condition = IsNull(ar)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(0, None, None, 0, 4, 4)
+
+validateEstimatedStats(filterNode, filteredColStats, Some(0L))
+  }
+
+  test("filter estimation with IS NOT NULL comparison") {
+// the predicate is "WHERE key1 IS NOT NULL"
+val condition =

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522411
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
--- End diff --

just `filter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15435
  
**[Test build #71179 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71179/testReport)**
 for PR 15435 at commit 
[`3101f29`](https://github.com/apache/spark/commit/3101f2905f314cbfa1df6383708a4438adbcff00).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522307
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
--- End diff --

this is just
```
val condition = EqualTo(ar, Literal(2))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522247
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils._
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * In this test suite, we test the proedicates containing the following 
operators:
+ * =, <, <=, >, >=, AND, OR, IS NULL, IS NOT NULL, IN, NOT IN
+ */
+
+class FilterEstimationSuite extends StatsEstimationTestBase {
+
+  // Suppose our test table has one column called "key1".
+  // It has 10 rows with values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+  // Hence, distinctCount:10, min:1, max:10, nullCount:0, avgLen:4, 
maxLen:4
+  val ar = AttributeReference("key1", IntegerType)()
+  val childColStat = ColumnStat(10, Some(1), Some(10), 0, 4, 4)
+  val child = StatsTestPlan(
+outputList = Seq(ar),
+stats = Statistics(
+  sizeInBytes = 10 * 4,
+  rowCount = Some(10),
+  attributeStats = AttributeMap(Seq(ar -> childColStat))
+)
+  )
+
+  test("filter estimation with equality comparison") {
+// the predicate is "WHERE key1 = 2"
+val intValue = Literal(2, IntegerType)
+val condition = EqualTo(ar, intValue)
+val filterNode = Filter(condition, child)
+val filteredColStats = ColumnStat(1, Some(2), Some(2), 0, 4, 4)
--- End diff --

as i commented on the other pr, i think we should use named arguments here 
so readers would know what 0, 4 ,4 means.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95522065
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,555 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class FilterEstimation extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, A 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @param plan a LogicalPlan node that must be an instance of Filter.
--- End diff --

we can probably remove this since it doesn't really carry any information 
... (plan's type is already Filter)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15435
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71177/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15435
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15435
  
**[Test build #71177 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71177/testReport)**
 for PR 15435 at commit 
[`3fb4bfc`](https://github.com/apache/spark/commit/3fb4bfcce42d07704c8da889f9ff82416adcbd08).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16528
  
**[Test build #71191 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71191/testReport)**
 for PR 16528 at commit 
[`0d1baf1`](https://github.com/apache/spark/commit/0d1baf1f3bc92e52ed6f29b0b06db5979ff0babf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16545: [SPARK-19166][SQL]rename from InsertIntoHadoopFsRelation...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16545
  
**[Test build #71190 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71190/testReport)**
 for PR 16545 at commit 
[`0622c04`](https://github.com/apache/spark/commit/0622c04c6129ef699ca0d8f6907d8bbc6d025387).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16545: [SPARK-19166][SQL]rename from InsertIntoHadoopFsR...

2017-01-10 Thread windpiger

GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/16545

[SPARK-19166][SQL]rename from 
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix

## What changes were proposed in this pull request?

InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files 
that match a static prefix, such as a partition file path(/table/foo=1), or a 
no partition file path(/xxx/a.json).

while the method name deleteMatchingPartitions indicates that only the 
partition file will be deleted. This name make a confused.

It is better to rename the method name.

## How was this patch tested?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark modifyAMethodName

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16545.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16545


commit 0622c04c6129ef699ca0d8f6907d8bbc6d025387
Author: windpiger 
Date:   2017-01-11T06:59:22Z

[SPARK-19166][SQL]change method name from 
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r95521090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -52,3 +56,12 @@ object EstimationUtils {
 }.sum
   }
 }
+
+/** Attribute Reference extractor */
+object ExtractAttr {
--- End diff --

is this necessary? isn't this just

```
case op @ EqualTo(ar: AttributeReference, l: Literal) =>
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16544
  
**[Test build #71189 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71189/testReport)**
 for PR 16544 at commit 
[`622d62b`](https://github.com/apache/spark/commit/622d62b8c1659f420f29cc36729f7cc5957a9027).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16544: [SPARK-19149][SQL] Follow-up: simplify cache impl...

2017-01-10 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/16544

[SPARK-19149][SQL] Follow-up: simplify cache implementation.

## What changes were proposed in this pull request?
This patch simplifies slightly the logical plan statistics cache 
implementation, as discussed in https://github.com/apache/spark/pull/16529

## How was this patch tested?
N/A - this has no behavior change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-19149

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16544


commit 622d62b8c1659f420f29cc36729f7cc5957a9027
Author: Reynold Xin 
Date:   2017-01-11T06:45:49Z

[SPARK-19149][SQL] Follow-up: simplify cache implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16544: [SPARK-19149][SQL] Follow-up: simplify cache implementat...

2017-01-10 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16544
  
cc @wzhfy


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71184/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16543
  
**[Test build #71184 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71184/consoleFull)**
 for PR 16543 at commit 
[`b7f934a`](https://github.com/apache/spark/commit/b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16543
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95520270
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast cann't perform, will throw an AnalysisException.
+Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
--- End diff --

Note that all tests can pass without this `Cast`, but it does fix a weird 
behavior: the result of a view query may have different schema if the view 
definition has been changed. shall we pull it out into a follow-up PR or do it 
here? cc @hvanhovell @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16269: [SPARK-19080][SQL] simplify data source analysis

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16269
  
**[Test build #71188 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71188/testReport)**
 for PR 16269 at commit 
[`87be209`](https://github.com/apache/spark/commit/87be2096d43ae4c9083c2e262c0816bca03dda32).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95519850
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 
1)
 }
   }
+
+  test("should be able to resolve a persistent view") {
--- End diff --

We don't define the behavior to resolve a view using a SQLContext in 
current master, this test case is to define that behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16529: [SPARK-19149] [SQL] Unify two sets of statistics in Logi...

2017-01-10 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16529
  
Actually the multi-threaded issue probably doesn't matter. I will just 
change it to the original Option implementation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16529: [SPARK-19149] [SQL] Unify two sets of statistics ...

2017-01-10 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16529


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/16441
  
thank you @sethah, I've updated the PR based on your latest comments.  
@jkbradley would you be able to take a look when you have time?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95519288
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2476,4 +2476,14 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   assert(sql("SELECT * FROM array_tbl where arr = ARRAY(1L)").count == 
1)
 }
   }
+
+  test("should be able to resolve a persistent view") {
--- End diff --

What is your goal for this test case? Any reason?

BTW, we should move this test case to `SQLQueryTestSuite`. Now, we are 
trying to migrate such test cases to there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16529: [SPARK-19149] [SQL] Unify two sets of statistics in Logi...

2017-01-10 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16529
  
Merging in master. I will fix the thread local thing in a pr.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16441
  
**[Test build #71187 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71187/testReport)**
 for PR 16441 at commit 
[`a6ef62e`](https://github.com/apache/spark/commit/a6ef62e65f76042d4ddfc0726125df12548efbaf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16510
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16510
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71183/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16510
  
**[Test build #71183 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71183/testReport)**
 for PR 16510 at commit 
[`b10ae77`](https://github.com/apache/spark/commit/b10ae7705bb91f6885b6c551b8e572cf72b34970).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16529: [SPARK-19149] [SQL] Unify two sets of statistics ...

2017-01-10 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16529#discussion_r95518724
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -81,44 +81,36 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] with Logging {
 }
   }
 
+  /** A cache for the estimated statistics, such that it will only be 
computed once. */
+  private val statsCache = new ThreadLocal[Option[Statistics]] {
--- End diff --

ok i was thinking you could just use an AtomicReference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/16441
  
with regards to the loss type, I think the real issue is that the user 
shouldn't be able to change the loss type at all on the model, as with many 
other parameters.  It seems strange to have the model and trainer share the 
same parameters in that case.  I think you are correct that users will never 
change the loss on the model in the future and expect the probability function 
to change, but just the fact that they can for some reason and it doesn't 
bothers me, but it's not a significant issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16249: [SPARK-18828][SPARKR] Refactor scripts for R

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16249
  
**[Test build #71186 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71186/testReport)**
 for PR 16249 at commit 
[`66fc83c`](https://github.com/apache/spark/commit/66fc83cb349f5cd8a34ed3d272b8e0ab7b0fe423).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16249: [SPARK-18828][SPARKR] Refactor scripts for R

2017-01-10 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16249
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95518300
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

How about to add in doc:
"the actual number partition can be increased as multiples of 
spark.r.maxAllocationLimit, or limited by the number of columns in data.frame."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r95518069
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -66,10 +72,157 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
 ParamsSuite.checkParams(new GBTClassifier)
 val model = new GBTClassificationModel("gbtc",
   Array(new DecisionTreeRegressionModel("dtr", new LeafNode(0.0, 0.0, 
null), 1)),
-  Array(1.0), 1)
+  Array(1.0), 1, 2)
 ParamsSuite.checkParams(model)
   }
 
+  test("GBTClassifier: default params") {
+val gbt = new GBTClassifier
+assert(gbt.getLabelCol === "label")
+assert(gbt.getFeaturesCol === "features")
+assert(gbt.getPredictionCol === "prediction")
+assert(gbt.getRawPredictionCol === "rawPrediction")
+assert(gbt.getProbabilityCol === "probability")
+val df = trainData.toDF()
+val model = gbt.fit(df)
+model.transform(df)
+  .select("label", "probability", "prediction", "rawPrediction")
+  .collect()
+intercept[NoSuchElementException] {
+  model.getThresholds
+}
+assert(model.getFeaturesCol === "features")
+assert(model.getPredictionCol === "prediction")
+assert(model.getRawPredictionCol === "rawPrediction")
+assert(model.getProbabilityCol === "probability")
+assert(model.hasParent)
+
+// copied model must have the same parent.
+MLTestingUtils.checkCopy(model)
+  }
+
+  test("setThreshold, getThreshold") {
+val gbt = new GBTClassifier
+
+// default
+withClue("GBTClassifier should not have thresholds set by default.") {
+  intercept[NoSuchElementException] {
+gbt.getThresholds
+  }
+}
+
+// Set via thresholds
+val gbt2 = new GBTClassifier
+val threshold = Array(0.3, 0.7)
+gbt2.setThresholds(threshold)
+assert(gbt2.getThresholds.zipWithIndex.forall(valueWithIndex =>
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95517753
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 
---
@@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("correctly resolve a nested view") {
+withTempDatabase { db =>
+  withView(s"$db.view1", s"$db.view2") {
+val view1 = CatalogTable(
+  identifier = TableIdentifier("view1", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM jt"),
+  viewText = Some("SELECT * FROM jt"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+val view2 = CatalogTable(
+  identifier = TableIdentifier("view2", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM view1"),
+  viewText = Some("SELECT * FROM view1"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> db})
+activateDatabase(db) {
+  hiveContext.sessionState.catalog.createTable(view1, 
ignoreIfExists = false)
+  hiveContext.sessionState.catalog.createTable(view2, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 
9).map(i => Row(i, i)))
+}
+  }
+}
+  }
+
+  test("correctly resolve a view with CTE") {
+withView("cte_view") {
+  val cte_view = CatalogTable(
+identifier = TableIdentifier("cte_view"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("n", "int"),
+viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM 
w"),
+viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(cte_view, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM cte_view"), Row(1))
+}
+  }
+
+  test("correctly resolve a view in a self join") {
+withView("join_view") {
+  val join_view = CatalogTable(
+identifier = TableIdentifier("join_view"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM jt"),
+viewText = Some("SELECT * FROM jt"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(join_view, 
ignoreIfExists = false)
+  checkAnswer(
+sql("SELECT * FROM join_view t1 JOIN join_view t2 ON t1.id = t2.id 
ORDER BY t1.id"),
+(1 to 9).map(i => Row(i, i, i, i)))
+}
+  }
+
+  private def assertInvalidReference(query: String): Unit = {
+val e = intercept[AnalysisException] {
+  sql(query)
+}.getMessage
+assert(e.contains("Table or view not found"))
+  }
+
+  test("error handling: fail if the referenced table or view is invalid") {
+withView("view1", "view2", "view3") {
+  // Fail if the referenced table is defined in a invalid database.
+  val view1 = CatalogTable(
+identifier = TableIdentifier("view1"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM invalid_db.jt"),
+viewText = Some("SELECT * FROM invalid_db.jt"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = 
false)
+  assertInvalidReference("SELECT * FROM view1")
+
+  // Fail if the referenced table is invalid.
+  val view2 = CatalogTable(
+identifier = TableIdentifier("view2"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM invalid_table"),
+viewText =

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r95517754
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
@@ -20,6 +20,12 @@ package org.apache.spark.mllib.tree.loss
 import org.apache.spark.annotation.{DeveloperApi, Since}
 import org.apache.spark.mllib.util.MLUtils
 
+/**
+ * Trait for adding probability function for the gradient boosting 
algorithm.
--- End diff --

moved to Loss.scala.  Removed doc on class.  Added doc for 
computeProbability method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15505
  
**[Test build #71185 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71185/testReport)**
 for PR 15505 at commit 
[`4ab93ee`](https://github.com/apache/spark/commit/4ab93eec0ccdab1b1141fc3eb0996fa4dbaf92d8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r95517520
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -275,18 +316,33 @@ class GBTClassificationModel private[ml](
   @Since("2.0.0")
   lazy val featureImportances: Vector = 
TreeEnsembleModel.featureImportances(trees, numFeatures)
 
+  private def margin(features: Vector): Double = {
+val treePredictions = 
_trees.map(_.rootNode.predictImpl(features).prediction)
+blas.ddot(numTrees, treePredictions, 1, _treeWeights, 1)
+  }
+
   /** (private[ml]) Convert to a model in the old API */
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
 
+  /**
+   * Note: this is currently an optimization that should be removed when 
we have more loss
+   * functions available than only logistic.
+   */
+  private lazy val loss = getOldLossType
--- End diff --

removed lazy, removed comment.  I made it lazy so as to not do the lookup 
if it doesn't need to be done, but since that isn't actually expensive and that 
only seemed to confuse it's better to remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16441#discussion_r95517316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -275,18 +316,33 @@ class GBTClassificationModel private[ml](
   @Since("2.0.0")
   lazy val featureImportances: Vector = 
TreeEnsembleModel.featureImportances(trees, numFeatures)
 
+  private def margin(features: Vector): Double = {
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamm...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16543
  
**[Test build #71184 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71184/consoleFull)**
 for PR 16543 at commit 
[`b7f934a`](https://github.com/apache/spark/commit/b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16473
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71176/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16543: [SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm f...

2017-01-10 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/16543

[SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamma, clarify glm 
family supported

## What changes were proposed in this pull request?

Backport to 2.0 (cherry picking from 2.1 didn't work)

## How was this patch tested?

unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rgammabackport20

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16543


commit b7f934ad2eb3f39125d9bc29289e8ce3a49f48b7
Author: Felix Cheung 
Date:   2017-01-11T06:02:44Z

fix Gamma family




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16473
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16473: [SPARK-19069] [CORE] Expose task 'status' and 'duration'...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16473
  
**[Test build #71176 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71176/testReport)**
 for PR 16473 at commit 
[`0df715d`](https://github.com/apache/spark/commit/0df715da04fb8349d8f2c1040c76fc92e1e7ad83).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95515867
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 
---
@@ -543,4 +545,157 @@ class SQLViewSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("correctly resolve a nested view") {
+withTempDatabase { db =>
+  withView(s"$db.view1", s"$db.view2") {
+val view1 = CatalogTable(
+  identifier = TableIdentifier("view1", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM jt"),
+  viewText = Some("SELECT * FROM jt"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+val view2 = CatalogTable(
+  identifier = TableIdentifier("view2", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  viewOriginalText = Some("SELECT * FROM view1"),
+  viewText = Some("SELECT * FROM view1"),
+  properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> db})
+activateDatabase(db) {
+  hiveContext.sessionState.catalog.createTable(view1, 
ignoreIfExists = false)
+  hiveContext.sessionState.catalog.createTable(view2, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM view2 ORDER BY id"), (1 to 
9).map(i => Row(i, i)))
+}
+  }
+}
+  }
+
+  test("correctly resolve a view with CTE") {
+withView("cte_view") {
+  val cte_view = CatalogTable(
+identifier = TableIdentifier("cte_view"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("n", "int"),
+viewOriginalText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM 
w"),
+viewText = Some("WITH w AS (SELECT 1 AS n) SELECT n FROM w"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(cte_view, 
ignoreIfExists = false)
+  checkAnswer(sql("SELECT * FROM cte_view"), Row(1))
+}
+  }
+
+  test("correctly resolve a view in a self join") {
+withView("join_view") {
+  val join_view = CatalogTable(
+identifier = TableIdentifier("join_view"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM jt"),
+viewText = Some("SELECT * FROM jt"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(join_view, 
ignoreIfExists = false)
+  checkAnswer(
+sql("SELECT * FROM join_view t1 JOIN join_view t2 ON t1.id = t2.id 
ORDER BY t1.id"),
+(1 to 9).map(i => Row(i, i, i, i)))
+}
+  }
+
+  private def assertInvalidReference(query: String): Unit = {
+val e = intercept[AnalysisException] {
+  sql(query)
+}.getMessage
+assert(e.contains("Table or view not found"))
+  }
+
+  test("error handling: fail if the referenced table or view is invalid") {
+withView("view1", "view2", "view3") {
+  // Fail if the referenced table is defined in a invalid database.
+  val view1 = CatalogTable(
+identifier = TableIdentifier("view1"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM invalid_db.jt"),
+viewText = Some("SELECT * FROM invalid_db.jt"),
+properties = Map[String, String] 
{CatalogTable.VIEW_DEFAULT_DATABASE -> "default"})
+  hiveContext.sessionState.catalog.createTable(view1, ignoreIfExists = 
false)
+  assertInvalidReference("SELECT * FROM view1")
+
+  // Fail if the referenced table is invalid.
+  val view2 = CatalogTable(
+identifier = TableIdentifier("view2"),
+tableType = CatalogTableType.VIEW,
+storage = CatalogStorageFormat.empty,
+schema = new StructType().add("id", "int").add("id1", "int"),
+viewOriginalText = Some("SELECT * FROM invalid_table"),
+viewText =

[GitHub] spark issue #4775: [SPARK-6016][SQL] Cannot read the parquet table after ove...

2017-01-10 Thread karthikgolagani

Github user karthikgolagani commented on the issue:

https://github.com/apache/spark/pull/4775
  
@liancheng 
Hi lian, if you are using sparkContext(sc), you can set 
("parquet.enable.summary-metadata", "false") like below:

sc.("parquet.enable.summary-metadata", "false"). This fixed my issue 
instantly .
I did it in my spark streaming application.

> WARN ParquetOutputCommitter: could not write summary file for 
hdfs://localhost/user/hive/warehouse java.lang.NullPointerException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95515728
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast cann't perform, will throw an AnalysisException.
--- End diff --

Nit: `cann't ` -> `can't `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16510: [SPARK-19130][SPARKR] Support setting literal val...

2017-01-10 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16510#discussion_r95515567
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1727,14 +1727,21 @@ setMethod("$", signature(x = "SparkDataFrame"),
 getColumn(x, name)
   })
 
-#' @param value a Column or \code{NULL}. If \code{NULL}, the specified 
Column is dropped.
+#' @param value a Column or an atomic vector in the length of 1 as literal 
value, or \code{NULL}.
+#'  If \code{NULL}, the specified Column is dropped.
 #' @rdname select
 #' @name $<-
 #' @aliases $<-,SparkDataFrame-method
 #' @note $<- since 1.4.0
 setMethod("$<-", signature(x = "SparkDataFrame"),
   function(x, name, value) {
-stopifnot(class(value) == "Column" || is.null(value))
+if (class(value) != "Column" && !is.null(value)) {
+  if (isAtomicLengthOne(value)) {
+value <- lit(value)
+  } else {
+stop("value must be a Column, Literal value as atomic in 
length of 1, or NULL")
--- End diff --

I was thinking org.apache.spark.sql.catalyst.expressions.Literal as the 
type, but I guess i can use the lower case term too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...

2017-01-10 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16510
  
**[Test build #71183 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71183/testReport)**
 for PR 16510 at commit 
[`b10ae77`](https://github.com/apache/spark/commit/b10ae7705bb91f6885b6c551b8e572cf72b34970).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95515250
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +545,87 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name(in AnalysisContext) if it is a 
view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use
+// `AnalysisContext.defaultDatabase` to track the current default 
database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to
+// `AnalysisContext.defaultDatabase`, we look up the relations that 
the view references using
+// the default database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = 
AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
used is decided by the
+// precedence:
+// 1. Use the database part of the table identifier, if it is defined;
+// 2. Use defaultDatabase, if it is defined(In this case, no temporary 
objects can be used,
+//and the default database is only used to look up a view);
+// 3. Use the currentDb of the SessionCatalog.
+private def lookupTableFromCatalog(
+u: UnresolvedRelation,
+defaultDatabase: Option[String] = None): LogicalPlan = {
   try {
-catalog.lookupRelation(u.tableIdentifier, u.alias)
+val tableIdentWithDb = u.tableIdentifier.copy(
+  database = u.tableIdentifier.database.orElse(defaultDatabase))
+catalog.lookupRelation(tableIdentWithDb, u.alias)
   } catch {
 case _: NoSuchTableException =>
   u.failAnalysis(s"Table or view not found: ${u.tableName}")
   }
 }
 
-def apply(plan: LogicalPlan):

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95514996
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
 ---
@@ -860,6 +864,24 @@ abstract class CatalogTestUtils {
   bucketSpec = Some(BucketSpec(4, Seq("col1"), Nil)))
   }
 
+  def newView(
+  name: String,
+  database: Option[String] = None): CatalogTable = {
+val viewDefaultDatabase = database.getOrElse("default")
+CatalogTable(
+  identifier = TableIdentifier(name, database),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType()
+.add("col1", "int")
+.add("col2", "string")
+.add("a", "int")
+.add("b", "string"),
+  viewOriginalText = Some("SELECT * FROM tbl1"),
+  viewText = Some("SELECT * FROM tbl1"),
+  properties = Map[String, String] {CatalogTable.VIEW_DEFAULT_DATABASE 
-> viewDefaultDatabase})
--- End diff --

nit: `Map(CatalogTable.VIEW_DEFAULT_DATABASE -> viewDefaultDatabase)`, 
scala comiler will infer the type for us


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95514898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -378,6 +379,35 @@ case class InsertIntoTable(
 }
 
 /**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child should be a logical plan parsed from the `CatalogTable.viewText`, 
should throw an error
+ * if the `viewText` is not defined.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be a 
logical plan parsed from the
+ *  `CatalogTable.viewText`, should throw an error if the 
`viewText` is not defined.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: LogicalPlan) extends LogicalPlan with MultiInstanceRelation {
--- End diff --

nit: extends `UnaryNode`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95514867
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast cann't perform, will throw an AnalysisException.
+Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
+  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
--- End diff --

nit: use `find`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95514830
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

it looks like it is intentional:
https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L115



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16531: [SPARK-19157][SQL] should be able to change spark...

2017-01-10 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16531


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16516: [SPARK-19155][ML] Make some string params of ML algorith...

2017-01-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16516
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71178/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 650 matches

Mail list logo