[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16696


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104550722
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala
 ---
@@ -1,64 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.statsEstimation
-
-import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
-import org.apache.spark.sql.types.IntegerType
-
-
-class StatsConfSuite extends StatsEstimationTestBase {
--- End diff --

yes. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104550347
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
+  child.stats(conf).copy(attributeStats = AttributeMap(Nil))
--- End diff --

Nit: `childStats.copy(attributeStats = AttributeMap(Nil))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104550119
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -752,14 +752,14 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+val rowCount: BigInt =
+  if (childStats.rowCount.isDefined) 
childStats.rowCount.get.min(limit) else limit
--- End diff --

Nit: `val rowCount: BigInt = 
childStats.rowCount.map(_.min(limit)).getOrElse(limit)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104546713
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
--- End diff --

Nit: the whole sentence does not have a period. How about rewriting it like?
> The output row count of LocalLimit should be the sum of row counts from 
each partition. However, since the number of partitions is not available here, 
we just use statistics of the child. Because the distirubtion after a limit 
operation is unknown, we do not propapage the column stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104526617
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
 ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.types.IntegerType
+
+
+class BasicStatsEstimationSuite extends StatsEstimationTestBase {
+  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = 
Some(1), max = Some(10),
--- End diff --

nit:
```
val attr = ...
vak colStat = ...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104281417
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
--- End diff --

A loose bound can lead to significant under-estimation. E.g. a > 50, after 
limit the actual range is [40, 60], while max and min in stats are still [0, 
60], then the filter factor will be 1/6 instead of 1/2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104281282
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
--- End diff --

hmm, what's the strategy here? is a loose bound better than nothing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104280684
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala
 ---
@@ -1,64 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.statsEstimation
-
-import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
-import org.apache.spark.sql.types.IntegerType
-
-
-class StatsConfSuite extends StatsEstimationTestBase {
--- End diff --

How to use `git mv` now? Do I need to revert to the unchanged version, and 
`git mv`, and then do all the changes all over again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104280554
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
--- End diff --

How can we make sure max/min values are still there after limit? Otherwise 
it will be a very loose bound of max/min.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104232961
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryNo
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
+val childStats = child.stats(conf)
+if (limit == 0) {
   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
   // (product of children).
-  1
+  Statistics(
+sizeInBytes = 1,
+rowCount = Some(0),
+isBroadcastable = childStats.isBroadcastable)
 } else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
+  // The output row count of LocalLimit should be the sum of row count 
from each partition, but
+  // since the partition number is not available here, we just use 
statistics of the child
+  // except column stats, because we don't know the distribution after 
a limit operation
--- End diff --

but I think the max/min should still be corrected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104223997
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala
 ---
@@ -1,64 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.statsEstimation
-
-import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
-import org.apache.spark.sql.types.IntegerType
-
-
-class StatsConfSuite extends StatsEstimationTestBase {
--- End diff --

Can you use `git mv`? Then, it will keep the change history. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104116975
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala
 ---
@@ -1,64 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.statsEstimation
-
-import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
-import org.apache.spark.sql.types.IntegerType
-
-
-class StatsConfSuite extends StatsEstimationTestBase {
--- End diff --

I didn't remove it, just renamed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104116520
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.types.IntegerType
+
+
+class StatsEstimationSuite extends StatsEstimationTestBase {
--- End diff --

Good name:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-03 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104116400
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -116,22 +116,22 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 withTempView("test") {
--- End diff --

yea I think so


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104101253
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -116,22 +116,22 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 withTempView("test") {
--- End diff --

is this test duplicated with the newly added limit test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104101031
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.statsEstimation
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.types.IntegerType
+
+
+class StatsEstimationSuite extends StatsEstimationTestBase {
--- End diff --

`BasicStatsEstimationSuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-03-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r104100931
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala
 ---
@@ -1,64 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.catalyst.statsEstimation
-
-import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
-import org.apache.spark.sql.types.IntegerType
-
-
-class StatsConfSuite extends StatsEstimationTestBase {
--- End diff --

why remove this test suite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r99837596
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -717,7 +717,8 @@ object Limit {
   }
 }
 
-case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends 
UnaryNode {
+abstract class LimitNode extends UnaryNode {
+  def limitExpr: Expression
--- End diff --

do we assume `limitExpr` is foldable? but seems there is no type checking 
logic for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r99838684
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
+  rowCount = Some(limit),
--- End diff --

This is a good point, maybe we should still separate the stats calculation 
of `GlobalLimit` and `LocalLimit`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-02-04 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r99474494
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -18,12 +18,41 @@
 package org.apache.spark.sql.catalyst.statsEstimation
 
 import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.IntegerType
 
 
-class StatsConfSuite extends StatsEstimationTestBase {
+class StatsEstimationSuite extends StatsEstimationTestBase {
+  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = 
Some(1), max = Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4))
+
+  val plan = StatsTestPlan(
+outputList = Seq(ar),
+attributeStats = AttributeMap(Seq(ar -> colStat)),
+rowCount = 10,
+size = Some(10 * (8 + 4)))
+
+  test("limit estimation") {
+val localLimit = LocalLimit(Literal(2), plan)
+val globalLimit = GlobalLimit(Literal(2), plan)
+// LocalLimit and GlobalLimit share the same stats estimation logic.
+val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
+checkStats(localLimit, expected)
+checkStats(globalLimit, expected)
+  }
+
+  test("sample estimation") {
+val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 
1000).toLong, plan)()
+checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = 
Some(5)))
+
+// Test if Sample's child doesn't have rowCount in stats
+val stats2 = Statistics(sizeInBytes = 120)
--- End diff --

For limit estimation test cases, we may add a test with limit number 
greater than a child node's row count.  This test can show if we properly 
select the smaller value between limit number child node's row count. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-31 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r98776491
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
--- End diff --

Agreed.  We can pick the smaller value between the child node's row count 
and the limit number. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r98146358
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
--- End diff --

We should. Otherwise the `rowCount` is not correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r98002931
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
+  rowCount = Some(limit),
--- End diff --

Actually the `rowCount` for `LocalLimit` and `GlobalLimit` should be 
different. For `LocalLimit`, `limit` is just the row count for one partition. 
But we can't get the number of partitions here, I think. As the actual row 
number might be quite bigger than the `limit`, maybe we should set it as None.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97933241
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -18,12 +18,41 @@
 package org.apache.spark.sql.catalyst.statsEstimation
 
 import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.IntegerType
 
 
-class StatsConfSuite extends StatsEstimationTestBase {
+class StatsEstimationSuite extends StatsEstimationTestBase {
+  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = 
Some(1), max = Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4))
+
+  val plan = StatsTestPlan(
+outputList = Seq(ar),
+attributeStats = AttributeMap(Seq(ar -> colStat)),
+rowCount = 10,
+size = Some(10 * (8 + 4)))
+
+  test("limit estimation") {
+val localLimit = LocalLimit(Literal(2), plan)
+val globalLimit = GlobalLimit(Literal(2), plan)
+// LocalLimit and GlobalLimit share the same stats estimation logic.
+val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
+checkStats(localLimit, expected)
+checkStats(globalLimit, expected)
+  }
+
+  test("sample estimation") {
+val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 
1000).toLong, plan)()
+checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = 
Some(5)))
+
+// Test if Sample's child doesn't have rowCount in stats
+val stats2 = Statistics(sizeInBytes = 120)
--- End diff --

The same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97933222
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -18,12 +18,41 @@
 package org.apache.spark.sql.catalyst.statsEstimation
 
 import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.IntegerType
 
 
-class StatsConfSuite extends StatsEstimationTestBase {
+class StatsEstimationSuite extends StatsEstimationTestBase {
+  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = 
Some(1), max = Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4))
+
+  val plan = StatsTestPlan(
+outputList = Seq(ar),
+attributeStats = AttributeMap(Seq(ar -> colStat)),
+rowCount = 10,
+size = Some(10 * (8 + 4)))
+
+  test("limit estimation") {
+val localLimit = LocalLimit(Literal(2), plan)
+val globalLimit = GlobalLimit(Literal(2), plan)
+// LocalLimit and GlobalLimit share the same stats estimation logic.
+val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
+checkStats(localLimit, expected)
+checkStats(globalLimit, expected)
+  }
+
+  test("sample estimation") {
+val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 
1000).toLong, plan)()
+checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = 
Some(5)))
+
+// Test if Sample's child doesn't have rowCount in stats
+val stats2 = Statistics(sizeInBytes = 120)
+val plan2 = DummyLogicalPlan(stats2, stats2)
--- End diff --

rename `plan2` to `childPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97933132
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -48,6 +77,14 @@ class StatsConfSuite extends StatsEstimationTestBase {
 // Return the simple statistics
 assert(plan.stats(conf.copy(cboEnabled = false)) == 
expectedDefaultStats)
   }
+
+  /** Check estimated stats which is the same when cbo is turned on/off. */
+  private def checkStats(plan: LogicalPlan, expected: Statistics): Unit = {
--- End diff --

You know, this is a utility function. We can make it more general by having 
two expected stats values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97933053
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -48,6 +77,14 @@ class StatsConfSuite extends StatsEstimationTestBase {
 // Return the simple statistics
 assert(plan.stats(conf.copy(cboEnabled = false)) == 
expectedDefaultStats)
--- End diff --

Could you replace the above three lines by `checkStats`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97932867
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala
 ---
@@ -18,12 +18,41 @@
 package org.apache.spark.sql.catalyst.statsEstimation
 
 import org.apache.spark.sql.catalyst.CatalystConf
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
-import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference, Literal}
+import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.IntegerType
 
 
-class StatsConfSuite extends StatsEstimationTestBase {
+class StatsEstimationSuite extends StatsEstimationTestBase {
+  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = 
Some(1), max = Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4))
+
+  val plan = StatsTestPlan(
+outputList = Seq(ar),
+attributeStats = AttributeMap(Seq(ar -> colStat)),
+rowCount = 10,
+size = Some(10 * (8 + 4)))
--- End diff --

I still prefer to adding a comment above this line:
```
  // rowCount * (overhead + column size)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97932053
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
--- End diff --

I think @wzhfy is just keeping the existing code logics. Sure, we can 
improve it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97932007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
--- End diff --

To make the stats more accurate, yes, we can use a smaller number between 
`childStats.rowCounts` and `limit` as `outputRowCount` of `getOutputSize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97931646
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -791,12 +773,14 @@ case class Sample(
 
   override def computeStats(conf: CatalystConf): Statistics = {
 val ratio = upperBound - lowerBound
-// BigInt can't multiply with Double
-var sizeInBytes = child.stats(conf).sizeInBytes * (ratio * 100).toInt 
/ 100
+val childStats = child.stats(conf)
+var sizeInBytes = 
EstimationUtils.ceil(BigDecimal(childStats.sizeInBytes) * ratio)
 if (sizeInBytes == 0) {
   sizeInBytes = 1
 }
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val sampledNumber = childStats.rowCount.map(c => 
EstimationUtils.ceil(BigDecimal(c) * ratio))
--- End diff --

`sampledNumber` -> `sampledRowCount`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97931570
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -29,6 +31,8 @@ object EstimationUtils {
   def rowCountsExist(conf: CatalystConf, plans: LogicalPlan*): Boolean =
 plans.forall(_.stats(conf).rowCount.isDefined)
 
+  def ceil(bigDecimal: BigDecimal): BigInt = bigDecimal.setScale(0, 
RoundingMode.CEILING).toBigInt()
--- End diff --

`ceil` -> `ceiling`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16696#discussion_r97928605
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: 
LogicalPlan) extends UnaryN
   }
   override def computeStats(conf: CatalystConf): Statistics = {
 val limit = limitExpr.eval().asInstanceOf[Int]
-val sizeInBytes = if (limit == 0) {
-  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also 
be zero
-  // (product of children).
-  1
-} else {
-  (limit: Long) * output.map(a => a.dataType.defaultSize).sum
-}
-child.stats(conf).copy(sizeInBytes = sizeInBytes)
+val childStats = child.stats(conf)
+// Don't propagate column stats, because we don't know the 
distribution after a limit operation
+Statistics(
+  sizeInBytes = EstimationUtils.getOutputSize(output, limit, 
childStats.attributeStats),
--- End diff --

Why don't we use `childStats.rowCount`? If `childStats.rowCount` is less 
than limit number, I think we should use it instead of limit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

2017-01-24 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/16696

[SPARK-19350] [SQL] Cardinality estimation of Limit and Sample

## What changes were proposed in this pull request?

Before this pr, LocalLimit/GlobalLimit/Sample propagates the same row count 
and column stats from its child, which is incorrect.
We can get the correct rowCount in Statistics for Limit/Sample whether cbo 
is enabled or not. Column stats should not be propagated because we don't know 
the distribution of columns after Limit or Sample.

## How was this patch tested?

Added test cases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark limitEstimation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16696.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16696


commit 62013f5c5bc1a6d43e1b111aa3784b4524c8fda4
Author: wangzhenhua 
Date:   2017-01-25T01:00:08Z

limit and sample estimation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org