[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-04-06 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17558
  
@wangyum what if the task requires that jar? From your fix what I got is 
that you catch the exception and make it warning log instead, but what if that 
task requires the jar, will you fix suppress the exception or defer the 
exception to others like `ClassNotFound` in the task runtime?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317549
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -328,7 +329,7 @@ object PartitioningUtils {
 } else {
   // TODO: Selective case sensitivity.
   val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase(Locale.ROOT))).distinct
--- End diff --

I think this might cause a similar problem with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110298557
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -396,7 +397,7 @@ object PartitioningAwareFileIndex extends Logging {
   sessionOpt: Option[SparkSession]): Seq[FileStatus] = {
 logTrace(s"Listing $path")
 val fs = path.getFileSystem(hadoopConf)
-val name = path.getName.toLowerCase
+val name = path.getName.toLowerCase(Locale.ROOT)
--- End diff --

(This variable seems not used.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala 
---
@@ -222,7 +225,7 @@ case class PreprocessTableCreation(sparkSession: 
SparkSession) extends Rule[Logi
 val columnNames = if 
(sparkSession.sessionState.conf.caseSensitiveAnalysis) {
   schema.map(_.name)
 } else {
-  schema.map(_.name.toLowerCase)
+  schema.map(_.name.toLowerCase(Locale.ROOT))
--- End diff --

Maybe, it is not good to point the similar instances all but let me just 
point this out as the change looks big. This maybe the similar instances with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317441
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -128,7 +128,8 @@ object PartitioningUtils {
   //   "hdfs://host:9000/invalidPath"
   //   "hdfs://host:9000/path"
   // TODO: Selective case sensitivity.
-  val discoveredBasePaths = 
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase())
+  val discoveredBasePaths =
+  
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase(Locale.ROOT))
--- End diff --

I am worried of this one too. It sounds the path could contains Turkish 
characters I guess..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314669
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringKeyHashMap.scala
 ---
@@ -25,7 +27,7 @@ object StringKeyHashMap {
   def apply[T](caseSensitive: Boolean): StringKeyHashMap[T] = if 
(caseSensitive) {
 new StringKeyHashMap[T](identity)
   } else {
-new StringKeyHashMap[T](_.toLowerCase)
+new StringKeyHashMap[T](_.toLowerCase(Locale.ROOT))
--- End diff --

This only seems used in `SimpleFunctionRegistry`. I don't think we have 
Turkish characters in function names and I don't think users will use other 
language in the function names. So probably it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110315394
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -82,8 +84,8 @@ case class OptimizeMetadataOnlyQuery(
   private def getPartitionAttrs(
   partitionColumnNames: Seq[String],
   relation: LogicalPlan): Seq[Attribute] = {
-val partColumns = partitionColumnNames.map(_.toLowerCase).toSet
-relation.output.filter(a => partColumns.contains(a.name.toLowerCase))
+val partColumns = 
partitionColumnNames.map(_.toLowerCase(Locale.ROOT)).toSet
+relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase(Locale.ROOT)))
--- End diff --

I am little bit worried of this change likewise. For example,

Before

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala>  val partColumns = Seq("I").map(_.toLowerCase).toSet
partColumns: scala.collection.immutable.Set[String] = Set(ı)

scala> Seq("a", "ı", "I").filter(a => partColumns.contains(a.toLowerCase))
res13: Seq[String] = List(ı, I)
```

After

```scala
scala> val partColumns = 
Seq("I").map(_.toLowerCase(java.util.Locale.ROOT)).toSet
partColumns: scala.collection.immutable.Set[String] = Set(i)

scala> Seq("a", "ı", "I").filter(a => 
partColumns.contains(a.toLowerCase(java.util.Locale.ROOT)))
res14: Seq[String] = List(I)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314541
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Maybe nitpicking and it is rarely possible I guess. However, up to my 
knowledge, this will affect the options users set, `spark.read.option(...)`. 
Namely, I think these case possible as below:

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala> val originalMap = Map("ı" -> 1, "I" -> 2)
originalMap: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, I -> 
2)
```

Before

```scala
scala> originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase))
res6: scala.collection.immutable.Map[String,Int] = Map(ı -> 2)
```

After

```scala
scala> originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(java.util.Locale.ROOT)))
res7: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, i -> 2)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317272
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -52,7 +54,11 @@ case class HadoopFsRelation(
 
   val schema: StructType = {
 val getColName: (StructField => String) =
-  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name 
else _.name.toLowerCase
+  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) {
+_.name
+  } else {
+_.name.toLowerCase(Locale.ROOT)
+  }
--- End diff --

I think we should leave this out. It seems `dataSchema` means the schema 
from the source which is exposed to users. I think this could cause a problem. 
For example as below:

Before

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("ı", StringType) :: 
Nil)

val getColName: (StructField => String) = _.name.toLowerCase

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(I)
```

After

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("ı", StringType) :: 
Nil)

val getColName: (StructField => String) = 
_.name.toLowerCase(java.util.Locale.ROOT)

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(ı, I)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110318802
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -134,7 +132,7 @@ case class CostBasedJoinReorder(conf: SQLConf) extends 
Rule[LogicalPlan] with Pr
  * For cost evaluation, since physical costs for operators are not 
available currently, we use
  * cardinalities and sizes to compute costs.
  */
-object JoinReorderDP extends PredicateHelper with Logging {
+case class JoinReorderDP(conf: SQLConf) extends PredicateHelper with 
Logging {
--- End diff --

@gatorsmile I would like to control the filters on top of the join 
enumeration. We might have other filters, e.g. left-deep trees only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110318621
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
+buildConf("spark.sql.cbo.joinReorder.dp.star.filter")
+  .doc("Applies star-join filter heuristics to cost based join 
enumeration.")
+  .booleanConf
+  .createWithDefault(false)
--- End diff --

@gatorsmile Regardless of the default value, I still want to control the 
filters with their own knobs. The filters are applied on top of the join 
enumeration. They need to have their own control.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...

2017-04-06 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/17516
  
Don't we also need the skip if cran statement ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17516
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17516
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75589/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17516
  
**[Test build #75589 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75589/testReport)**
 for PR 17516 at commit 
[`a3e8b35`](https://github.com/apache/spark/commit/a3e8b350c6ff6aff3b1537de64bfeda602d8aa11).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17557
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17557
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75588/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17557
  
**[Test build #75588 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75588/testReport)**
 for PR 17557 at commit 
[`27e94fd`](https://github.com/apache/spark/commit/27e94fd6732edc50762cf6bc7e17e900ea1ff313).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110318101
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

@viirya The cost-based optimizer will find the best plan for the star-join. 
The star filter is a heuristic within join enumeration to limit the join 
sequences evaluated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-06 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/17222
  
I'll try and follow up this weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

2017-04-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17494
  
Thanks @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

2017-04-06 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/17494
  
LGTM as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17222
  
**[Test build #75591 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75591/testReport)**
 for PR 17222 at commit 
[`4da2994`](https://github.com/apache/spark/commit/4da29941bdaef13fb94bd0d16e63cba8c8d197bc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-06 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@viirya Thanks for careful review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

2017-04-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17494
  
Thanks @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17222
  
LGTM, see if @marmbrus or @holdenk have any more comments about this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r110316824
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -436,6 +436,20 @@ def test_udf_with_order_by_and_limit(self):
 res.explain(True)
 self.assertEqual(res.collect(), [Row(id=0, copy=0)])
 
+def test_non_existed_udf(self):
+try:
+self.spark.udf.registerJavaFunction("udf1", "non_existed_udf")
+self.fail("should fail due to can not load java udf class")
+except py4j.protocol.Py4JError as e:
+self.assertTrue("Can not load class non_existed_udf" in str(e))
+
+def test_non_existed_udaf(self):
+try:
+self.spark.udf.registerJavaUDAF("udf1", "non_existed_udaf")
--- End diff --

nit: udf1 -> udaf1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17558
  
**[Test build #75590 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75590/testReport)**
 for PR 17558 at commit 
[`de5b5fe`](https://github.com/apache/spark/commit/de5b5fe5942bdea0fbd0a98ee11fcca035dccaf0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17546
  
This looks pretty good over all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110316465
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
--- End diff --

So we can have this as true by default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17558: [SPARK-20247][CORE] Add jar but this jar is missi...

2017-04-06 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/17558

[SPARK-20247][CORE] Add jar but this jar is missing later shouldn't affect 
jobs that doesn't use this jar

## What changes were proposed in this pull request?

Catch exception when jar is missing, as 
[SPARK-20247](https://issues.apache.org/jira/browse/SPARK-20247) described.

## How was this patch tested?
unit tests and manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-20247

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17558.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17558


commit de5b5fe5942bdea0fbd0a98ee11fcca035dccaf0
Author: Yuming Wang 
Date:   2017-04-07T04:51:01Z

Catch exception when jar is missing.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17516
  
**[Test build #75589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75589/testReport)**
 for PR 17516 at commit 
[`a3e8b35`](https://github.com/apache/spark/commit/a3e8b350c6ff6aff3b1537de64bfeda602d8aa11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17557
  
**[Test build #75588 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75588/testReport)**
 for PR 17557 at commit 
[`27e94fd`](https://github.com/apache/spark/commit/27e94fd6732edc50762cf6bc7e17e900ea1ff313).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth s...

2017-04-06 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17557

[SPARK-20208][WIP][R][DOCS] Document R fpGrowth support

## What changes were proposed in this pull request?

Document  fpGrowth in:

- vignettes
- programming guide
- code example

## How was this patch tested?

TODO


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20208

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17557.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17557


commit 94d0cf2fcb3474b5c7217d85ebfe81819bd1dc9e
Author: zero323 
Date:   2017-04-06T14:59:14Z

List FP-growth among available algorithms

commit 27e94fd6732edc50762cf6bc7e17e900ea1ff313
Author: zero323 
Date:   2017-04-06T15:38:18Z

Add basic description




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-04-06 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15770
  
Any update on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR...

2017-04-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17553#discussion_r110315204
  
--- Diff: examples/src/main/r/ml/glm.R ---
@@ -56,6 +56,15 @@ summary(binomialGLM)
 # Prediction
 binomialPredictions <- predict(binomialGLM, binomialTestDF)
 head(binomialPredictions)
+
+# Fit a generalized linear model of family "tweedie" with spark.glm
+training3 <- 
read.df("data/mllib/sample_multiclass_classification_data.txt", source = 
"libsvm")
+tweedieDF <- transform(training3, label= training3$label * exp(randn(10)))
--- End diff --

nite, style: `label = trai...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110314839
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
 ---
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import 
org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, 
StatsTestPlan}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf._
+
+
+class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBase {
+
+  override val conf = new SQLConf().copy(
+CASE_SENSITIVE -> true,
+CBO_ENABLED -> true,
+JOIN_REORDER_ENABLED -> true,
+STARSCHEMA_DETECTION -> true,
+JOIN_REORDER_DP_STAR_FILTER -> true)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("Operator Optimizations", FixedPoint(100),
+CombineFilters,
+PushDownPredicate,
+ReorderJoin(conf),
+PushPredicateThroughJoin,
+ColumnPruning,
+CollapseProject) ::
+Batch("Join Reorder", Once,
+  CostBasedJoinReorder(conf)) :: Nil
+  }
+
+  private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq(
+// F1 (fact table)
+attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D1 (dimension)
+attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D2 (dimension)
+attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D3 (dimension)
+attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// T1 (regular table i.e. outside star)
+attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 1, avgLen = 4, maxLen = 4),
  

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17556
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110314588
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
 ---
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import 
org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, 
StatsTestPlan}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf._
+
+
+class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBase {
+
+  override val conf = new SQLConf().copy(
+CASE_SENSITIVE -> true,
+CBO_ENABLED -> true,
+JOIN_REORDER_ENABLED -> true,
+STARSCHEMA_DETECTION -> true,
+JOIN_REORDER_DP_STAR_FILTER -> true)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("Operator Optimizations", FixedPoint(100),
+CombineFilters,
+PushDownPredicate,
+ReorderJoin(conf),
+PushPredicateThroughJoin,
+ColumnPruning,
+CollapseProject) ::
+Batch("Join Reorder", Once,
+  CostBasedJoinReorder(conf)) :: Nil
+  }
+
+  private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq(
+// F1 (fact table)
+attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D1 (dimension)
+attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D2 (dimension)
+attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D3 (dimension)
+attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// T1 (regular table i.e. outside star)
+attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 1, avgLen = 4, maxLen = 4),
  

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-06 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/17556

[SPARK-16957][MLlib] Use weighted midpoints for split values.

## What changes were proposed in this pull request?

Use weighted midpoints for split values.

## How was this patch tested?

+ [x] add unit test.
+ [x] modify Split's unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
ENH/decision_tree_overflow_and_precision_in_aggregation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17556


commit 45b74930eea787411855fc35a7ad7198b35d577e
Author: 颜发才(Yan Facai) 
Date:   2017-04-07T04:02:13Z

TST: add test case

commit c49d3ae7db0e66855b0c896375b11bf51d9ac482
Author: 颜发才(Yan Facai) 
Date:   2017-04-07T04:05:36Z

ENH: use weighted midpoints

commit 387eb498054289149706ecd2f88593d008fd074f
Author: 颜发才(Yan Facai) 
Date:   2017-04-07T04:13:44Z

BUG: constant feature, outOfIndex

commit 2e68f1efca59772d1e905474c2392ad0d8b413c8
Author: 颜发才(Yan Facai) 
Date:   2017-04-07T04:15:09Z

TST: modify split's test case

commit 6a5806f35185596ffda2c88c4879ecaf0be3bda1
Author: 颜发才(Yan Facai) 
Date:   2017-04-07T04:24:02Z

CLN: move test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110314369
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
--- End diff --

@viirya Star join plans are expected to have an optimal execution based on 
their referential integrity constraints among the tables. It is a good 
heuristic. I expect that once CBO is enabled by default, star joins will also 
be enabled.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110313675
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -54,14 +54,12 @@ case class CostBasedJoinReorder(conf: SQLConf) extends 
Rule[LogicalPlan] with Pr
 
   private def reorder(plan: LogicalPlan, output: Seq[Attribute]): 
LogicalPlan = {
 val (items, conditions) = extractInnerJoins(plan)
-// TODO: Compute the set of star-joins and use them in the join 
enumeration
-// algorithm to prune un-optimal plan choices.
 val result =
   // Do reordering if the number of items is appropriate and join 
conditions exist.
   // We also need to check if costs of all items can be evaluated.
   if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
   items.forall(_.stats(conf).rowCount.isDefined)) {
-JoinReorderDP.search(conf, items, conditions, output)
+JoinReorderDP(conf).search(conf, items, conditions, output)
--- End diff --

Revert it back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110313661
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -134,7 +132,7 @@ case class CostBasedJoinReorder(conf: SQLConf) extends 
Rule[LogicalPlan] with Pr
  * For cost evaluation, since physical costs for operators are not 
available currently, we use
  * cardinalities and sizes to compute costs.
  */
-object JoinReorderDP extends PredicateHelper with Logging {
+case class JoinReorderDP(conf: SQLConf) extends PredicateHelper with 
Logging {
--- End diff --

Revert it back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110313369
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
+buildConf("spark.sql.cbo.joinReorder.dp.star.filter")
+  .doc("Applies star-join filter heuristics to cost based join 
enumeration.")
+  .booleanConf
+  .createWithDefault(false)
--- End diff --

cc @wzhfy @ron8hu @sameeragarwal @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110313349
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
+buildConf("spark.sql.cbo.joinReorder.dp.star.filter")
+  .doc("Applies star-join filter heuristics to cost based join 
enumeration.")
+  .booleanConf
+  .createWithDefault(false)
--- End diff --

The logics will be enabled if and only if both `conf.cboEnabled` and 
`conf.joinReorderEnabled` are true. Thus, it is safe to be `true` by default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17552#discussion_r110312633
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
 ---
@@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources
 
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
 import org.apache.spark.sql.catalyst.catalog.CatalogTable
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
+import org.apache.spark.sql.catalyst.expressions.{AttributeMap, 
AttributeReference}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.util.Utils
 
 /**
  * Used to link a [[BaseRelation]] in to a logical query plan.
- *
- * Note that sometimes we need to use `LogicalRelation` to replace an 
existing leaf node without
- * changing the output attributes' IDs.  The `expectedOutputAttributes` 
parameter is used for
- * this purpose.  See https://issues.apache.org/jira/browse/SPARK-10741 
for more details.
  */
 case class LogicalRelation(
 relation: BaseRelation,
-expectedOutputAttributes: Option[Seq[Attribute]] = None,
-catalogTable: Option[CatalogTable] = None)
+output: Seq[AttributeReference],
+catalogTable: Option[CatalogTable])
   extends LeafNode with MultiInstanceRelation {
 
-  override val output: Seq[AttributeReference] = {
-val attrs = relation.schema.toAttributes
-expectedOutputAttributes.map { expectedAttrs =>
-  assert(expectedAttrs.length == attrs.length)
-  attrs.zip(expectedAttrs).map {
-// We should respect the attribute names provided by base relation 
and only use the
-// exprId in `expectedOutputAttributes`.
-// The reason is that, some relations(like parquet) will reconcile 
attribute names to
-// workaround case insensitivity issue.
-case (attr, expected) => attr.withExprId(expected.exprId)
--- End diff --

Agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17552
  
LGTM pending Jenkins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17552
  
**[Test build #75587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75587/testReport)**
 for PR 17552 at commit 
[`0fbd4a6`](https://github.com/apache/spark/commit/0fbd4a65f4c8242626fb35029cb22ce502dc696f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...

2017-04-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17552#discussion_r110311641
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
 ---
@@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources
 
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
 import org.apache.spark.sql.catalyst.catalog.CatalogTable
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
+import org.apache.spark.sql.catalyst.expressions.{AttributeMap, 
AttributeReference}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.util.Utils
 
 /**
  * Used to link a [[BaseRelation]] in to a logical query plan.
- *
- * Note that sometimes we need to use `LogicalRelation` to replace an 
existing leaf node without
- * changing the output attributes' IDs.  The `expectedOutputAttributes` 
parameter is used for
- * this purpose.  See https://issues.apache.org/jira/browse/SPARK-10741 
for more details.
  */
 case class LogicalRelation(
 relation: BaseRelation,
-expectedOutputAttributes: Option[Seq[Attribute]] = None,
-catalogTable: Option[CatalogTable] = None)
+output: Seq[AttributeReference],
+catalogTable: Option[CatalogTable])
   extends LeafNode with MultiInstanceRelation {
 
-  override val output: Seq[AttributeReference] = {
-val attrs = relation.schema.toAttributes
-expectedOutputAttributes.map { expectedAttrs =>
-  assert(expectedAttrs.length == attrs.length)
-  attrs.zip(expectedAttrs).map {
-// We should respect the attribute names provided by base relation 
and only use the
-// exprId in `expectedOutputAttributes`.
-// The reason is that, some relations(like parquet) will reconcile 
attribute names to
-// workaround case insensitivity issue.
-case (attr, expected) => attr.withExprId(expected.exprId)
--- End diff --

good catch! I found this logic is only useful when converting hive tables 
to data source tables, so I moved the logic there: 
https://github.com/apache/spark/pull/17552/files#diff-ee66e11b56c21364760a5ed2b783f863R215


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110309359
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
 ---
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import 
org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, 
StatsTestPlan}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf._
+
+
+class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBase {
+
+  override val conf = new SQLConf().copy(
+CASE_SENSITIVE -> true,
+CBO_ENABLED -> true,
+JOIN_REORDER_ENABLED -> true,
+STARSCHEMA_DETECTION -> true,
+JOIN_REORDER_DP_STAR_FILTER -> true)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("Operator Optimizations", FixedPoint(100),
+CombineFilters,
+PushDownPredicate,
+ReorderJoin(conf),
+PushPredicateThroughJoin,
+ColumnPruning,
+CollapseProject) ::
+Batch("Join Reorder", Once,
+  CostBasedJoinReorder(conf)) :: Nil
+  }
+
+  private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq(
+// F1 (fact table)
+attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D1 (dimension)
+attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D2 (dimension)
+attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D3 (dimension)
+attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// T1 (regular table i.e. outside star)
+attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 1, avgLen = 4, maxLen = 4),
+

[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110309073
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
 ---
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import 
org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, 
StatsTestPlan}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf._
+
+
+class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBase {
+
+  override val conf = new SQLConf().copy(
+CASE_SENSITIVE -> true,
+CBO_ENABLED -> true,
+JOIN_REORDER_ENABLED -> true,
+STARSCHEMA_DETECTION -> true,
+JOIN_REORDER_DP_STAR_FILTER -> true)
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("Operator Optimizations", FixedPoint(100),
+CombineFilters,
+PushDownPredicate,
+ReorderJoin(conf),
+PushPredicateThroughJoin,
+ColumnPruning,
+CollapseProject) ::
+Batch("Join Reorder", Once,
+  CostBasedJoinReorder(conf)) :: Nil
+  }
+
+  private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq(
+// F1 (fact table)
+attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D1 (dimension)
+attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = 
Some(100),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = 
Some(50),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D2 (dimension)
+attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// D3 (dimension)
+attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = 
Some(10),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = 
Some(5),
+  nullCount = 0, avgLen = 4, maxLen = 4),
+
+// T1 (regular table i.e. outside star)
+attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = 
Some(20),
+  nullCount = 1, avgLen = 4, maxLen = 4),
+

[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110308327
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -736,6 +736,12 @@ object SQLConf {
   .checkValue(weight => weight >= 0 && weight <= 1, "The weight value 
must be in [0, 1].")
   .createWithDefault(0.7)
 
+  val JOIN_REORDER_DP_STAR_FILTER =
--- End diff --

Is there any cases we don't want to enable this if cbo is enabled?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110307898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
--- End diff --

ok for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110307786
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

oh. right. forget that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110307666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

`ReorderJoin` is done heuristically. It can be useful when cbo is off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110306486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

So do we still need `ReorderJoin`? Looks like we don't need it anymore if 
we don't care about the order created by it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17555: [SPARK-19495][SQL] Make SQLConf slightly more ext...

2017-04-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17555


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17555
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110305903
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17555
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75586/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17555
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17555
  
**[Test build #75586 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75586/testReport)**
 for PR 17555 at commit 
[`6084d95`](https://github.com/apache/spark/commit/6084d9507a19ddf4e4521bd28e1f96886d3a252e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...

2017-04-06 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17495
  
Ping @vanzin @tgravescs again. Sorry to bother you and really appreciate 
your time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...

2017-04-06 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/14617
  
I see. The current code leverages `SparkListenerBlockUpdated` event to 
calculate memory usage, let me try to investigate the feasibility of using 
`taskEnd.taskMetrics.updatedBlocks`, to see if it is possible to use this 
instead to calculate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...

2017-04-06 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/14617
  
yeah, we definitely don't want to start logging more events.  But it seems 
like this info is already available -- taskEnd.taskMetrics.updatedBlocks 
already has everything, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17534: [SPARK-20218]'/applications/[app-id]/stages' in R...

2017-04-06 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/17534#discussion_r110303522
  
--- Diff: docs/monitoring.md ---
@@ -299,12 +299,12 @@ can be identified by their `[attempt-id]`. In the API 
listed below, when running
   
 /applications/[app-id]/stages
 A list of all stages for a given application.
+?status=[active|complete|pending|failed] list only 
stages in the state.
   
   
 /applications/[app-id]/stages/[stage-id]
 
   A list of all attempts for the given stage.
-  ?status=[active|complete|pending|failed] list only 
stages in the state.
--- End diff --

It is filtering stages.
I have manual test this API locally and confirm this doesn't have effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110303409
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

If so, with this added filter, `CostBasedJoinReorder` can also let the star 
join plans together, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110302895
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

> Doesn't this cost-based join reorder rule breaks the order created by 
ReorderJoin?

This is expected from cost based reordering. `ReorderJoin` only puts 
connected items together, the order among these items is not optimized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110300420
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
+   *
+   * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 )
+   * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 }
+   * level 2: {d2 f1 d1 }
+   * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 }
+   * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 }
+   * level 5: {d1 t3 t2 f1 t1 d2 }
+   */
+  def starJoinFilter(
+  outer: Set[Int],
+  inner: Set[Int],
+  filters: JoinGraphInfo) : Boolean = {
+val starJoins = filters.starJoins
+val nonStarJoins = filters.nonStarJoins
+val join = outer.union(inner)
+
+// Disjoint sets
+outer.intersect(inner).isEmpty &&
+  // Either star or non-star is empty
+  (starJoins.isEmpty || nonStarJoins.isEmpty ||
+// Join is a subset of the star-join
+join.subsetOf(starJoins) ||
+// Star-join is a subset of join
+starJoins.subsetOf(join) ||
--- End diff --

`ReorderJoin` will reorder the star join plans. Doesn't this cost-based 
join reorder rule breaks the order created by `ReorderJoin`? Here we only ask 
this rule doesn't try to reorder part of star join plans and non-star join 
plans, but it still can reorder the order among star join plans.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...

2017-04-06 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/14617
  
Thanks @squito .

Regarding showing memory usage in history server. My major concern is that 
putting so many block update event into event log will significantly increase 
the file size and delay the replay, that's why in the current code we 
deliberately bypass the block update event. And IIUC in history server it is 
not necessary to show the change of used memory, only the final memory usage 
before application finished will be shown on the UI. So instead of recording 
and replaying all the block update events, just recording the final memory 
usage of each executor is enough. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17551: [SPARK-20242][Web UI] Add spark.ui.stopDelay

2017-04-06 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17551
  
@barnardb only in Spark standalone mode HistoryServer is embedded into 
Master process for convenience IIRC. You can always start a standalone 
HistoryServer process. 

Also `FsHistoryProvider` is not bound to HDFS, other Hadoop compatible FS 
could be supported, like wasb, s3 and other object stores that has Hadoop FS 
compatible layer. I would think even in your cluster environment (k8s), you 
probably have a object store. And at least you could implement a customized 
`ApplicationHistoryProvider`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...

2017-04-06 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17553
  
Could you add [SPARKR] to the PR title please





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16648
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16648
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75585/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16648
  
**[Test build #75585 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75585/testReport)**
 for PR 16648 at commit 
[`320db91`](https://github.com/apache/spark/commit/320db918d8064069907483610e8389b4a4d706c5).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75584/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15009
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15009
  
**[Test build #75584 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75584/testReport)**
 for PR 15009 at commit 
[`0cfd4a7`](https://github.com/apache/spark/commit/0cfd4a7eb540e751a03b9d8e78af4e8f6e3be62c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17555
  
**[Test build #75586 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75586/testReport)**
 for PR 17555 at commit 
[`6084d95`](https://github.com/apache/spark/commit/6084d9507a19ddf4e4521bd28e1f96886d3a252e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17555: [SPARK-19495][SQL] Make SQLConf slightly more ext...

2017-04-06 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17555

[SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum

## What changes were proposed in this pull request?
This is a tiny addendum to SPARK-19495 to remove the private visibility for 
copy, which is the only package private method in the entire file.

## How was this patch tested?
N/A - no semantic change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-19495-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17555


commit 6084d9507a19ddf4e4521bd28e1f96886d3a252e
Author: Reynold Xin 
Date:   2017-04-06T23:59:15Z

[SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17554: [MINOR][DOCS] Fix typo in Hive Examples

2017-04-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17554


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17554: [MINOR][DOCS] Fix typo in Hive Examples

2017-04-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17554
  
Thanks - merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17552
  
LGTM except only one comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...

2017-04-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17552#discussion_r110287008
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
 ---
@@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources
 
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
 import org.apache.spark.sql.catalyst.catalog.CatalogTable
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, 
AttributeReference}
+import org.apache.spark.sql.catalyst.expressions.{AttributeMap, 
AttributeReference}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.util.Utils
 
 /**
  * Used to link a [[BaseRelation]] in to a logical query plan.
- *
- * Note that sometimes we need to use `LogicalRelation` to replace an 
existing leaf node without
- * changing the output attributes' IDs.  The `expectedOutputAttributes` 
parameter is used for
- * this purpose.  See https://issues.apache.org/jira/browse/SPARK-10741 
for more details.
  */
 case class LogicalRelation(
 relation: BaseRelation,
-expectedOutputAttributes: Option[Seq[Attribute]] = None,
-catalogTable: Option[CatalogTable] = None)
+output: Seq[AttributeReference],
+catalogTable: Option[CatalogTable])
   extends LeafNode with MultiInstanceRelation {
 
-  override val output: Seq[AttributeReference] = {
-val attrs = relation.schema.toAttributes
-expectedOutputAttributes.map { expectedAttrs =>
-  assert(expectedAttrs.length == attrs.length)
-  attrs.zip(expectedAttrs).map {
-// We should respect the attribute names provided by base relation 
and only use the
-// exprId in `expectedOutputAttributes`.
-// The reason is that, some relations(like parquet) will reconcile 
attribute names to
-// workaround case insensitivity issue.
-case (attr, expected) => attr.withExprId(expected.exprId)
--- End diff --

It sounds like this logics mentioned in the comments is removed by this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-04-06 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17092
  
Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-04-06 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16966
  
Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17546
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75583/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17546
  
**[Test build #75583 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75583/testReport)**
 for PR 17546 at commit 
[`9e81154`](https://github.com/apache/spark/commit/9e81154f94441e78b4b3ac0cd20f53746276d030).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17554: [MINOR][DOCS] Fix typo in Hive Examples

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17554
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17554: [MINOR][DOCS] Fix typo in Hive Examples

2017-04-06 Thread cooper6581
GitHub user cooper6581 opened a pull request:

https://github.com/apache/spark/pull/17554

[MINOR][DOCS] Fix typo in Hive Examples

## What changes were proposed in this pull request?

Fix typo in hive examples from "DaraFrames" to "DataFrames"

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cooper6581/spark typo-daraframes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17554


commit a186830bf9c637c159c5673e52a89ac95f574eba
Author: Dustin Koupal 
Date:   2017-04-06T21:16:47Z

fix typo in hive examples




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16648
  
**[Test build #75585 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75585/testReport)**
 for PR 16648 at commit 
[`320db91`](https://github.com/apache/spark/commit/320db918d8064069907483610e8389b4a4d706c5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15009
  
**[Test build #75584 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75584/testReport)**
 for PR 15009 at commit 
[`0cfd4a7`](https://github.com/apache/spark/commit/0cfd4a7eb540e751a03b9d8e78af4e8f6e3be62c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17551: [SPARK-20242][Web UI] Add spark.ui.stopDelay

2017-04-06 Thread barnardb
Github user barnardb commented on the issue:

https://github.com/apache/spark/pull/17551
  
> It's still running your code, right? Why can't you add a configuration to 
your own code that tells it to wait some time before shutting down the 
SparkContext?

We're trying to support arbitrary jobs running on the cluster, to make it 
easy for users to inspect the jobs that they run there. This was a quick way to 
achieve that, but I agree with the other commenters that this quite hacky, and 
that the history server would be a nicer solution. Our problem with the history 
server right now is that while the current driver-side `EventLoggingListener` + 
history-server-side `FsHistoryProvider` implementations are great for 
environments with HDFS, they're much less convenient in a cluster without a 
distributed filesystem. I'd propose that I close this PR, and work on an 
RPC-based listener-provider combination to use with the history server.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread ioana-delaney
Github user ioana-delaney commented on the issue:

https://github.com/apache/spark/pull/17546
  
@wzhfy Yes, star-schema is called from both ```ReorderJoin``` and 
```CostBasedJoinReorder```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17546
  
**[Test build #75583 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75583/testReport)**
 for PR 17546 at commit 
[`9e81154`](https://github.com/apache/spark/commit/9e81154f94441e78b4b3ac0cd20f53746276d030).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17541
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75580/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17541
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan

2017-04-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17541
  
**[Test build #75580 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75580/testReport)**
 for PR 17541 at commit 
[`99f8ad3`](https://github.com/apache/spark/commit/99f8ad3536daae74340fd6ae59236e291cfdeb84).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17553
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >