[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/17558 @wangyum what if the task requires that jar? From your fix what I got is that you catch the exception and make it warning log instead, but what if that task requires the jar, will you fix suppress the exception or defer the exception to others like `ClassNotFound` in the task runtime? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110317549 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -328,7 +329,7 @@ object PartitioningUtils { } else { // TODO: Selective case sensitivity. val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase(Locale.ROOT))).distinct --- End diff -- I think this might cause a similar problem with https://github.com/apache/spark/pull/17527/files#r110317272. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110298557 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala --- @@ -396,7 +397,7 @@ object PartitioningAwareFileIndex extends Logging { sessionOpt: Option[SparkSession]): Seq[FileStatus] = { logTrace(s"Listing $path") val fs = path.getFileSystem(hadoopConf) -val name = path.getName.toLowerCase +val name = path.getName.toLowerCase(Locale.ROOT) --- End diff -- (This variable seems not used.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110317695 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -222,7 +225,7 @@ case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[Logi val columnNames = if (sparkSession.sessionState.conf.caseSensitiveAnalysis) { schema.map(_.name) } else { - schema.map(_.name.toLowerCase) + schema.map(_.name.toLowerCase(Locale.ROOT)) --- End diff -- Maybe, it is not good to point the similar instances all but let me just point this out as the change looks big. This maybe the similar instances with https://github.com/apache/spark/pull/17527/files#r110317272. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110317441 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -128,7 +128,8 @@ object PartitioningUtils { // "hdfs://host:9000/invalidPath" // "hdfs://host:9000/path" // TODO: Selective case sensitivity. - val discoveredBasePaths = optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase()) + val discoveredBasePaths = + optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase(Locale.ROOT)) --- End diff -- I am worried of this one too. It sounds the path could contains Turkish characters I guess.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110314669 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringKeyHashMap.scala --- @@ -25,7 +27,7 @@ object StringKeyHashMap { def apply[T](caseSensitive: Boolean): StringKeyHashMap[T] = if (caseSensitive) { new StringKeyHashMap[T](identity) } else { -new StringKeyHashMap[T](_.toLowerCase) +new StringKeyHashMap[T](_.toLowerCase(Locale.ROOT)) --- End diff -- This only seems used in `SimpleFunctionRegistry`. I don't think we have Turkish characters in function names and I don't think users will use other language in the function names. So probably it is fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110315394 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -82,8 +84,8 @@ case class OptimizeMetadataOnlyQuery( private def getPartitionAttrs( partitionColumnNames: Seq[String], relation: LogicalPlan): Seq[Attribute] = { -val partColumns = partitionColumnNames.map(_.toLowerCase).toSet -relation.output.filter(a => partColumns.contains(a.name.toLowerCase)) +val partColumns = partitionColumnNames.map(_.toLowerCase(Locale.ROOT)).toSet +relation.output.filter(a => partColumns.contains(a.name.toLowerCase(Locale.ROOT))) --- End diff -- I am little bit worried of this change likewise. For example, Before ```scala scala> java.util.Locale.setDefault(new java.util.Locale("tr")) scala> val partColumns = Seq("I").map(_.toLowerCase).toSet partColumns: scala.collection.immutable.Set[String] = Set(ı) scala> Seq("a", "ı", "I").filter(a => partColumns.contains(a.toLowerCase)) res13: Seq[String] = List(ı, I) ``` After ```scala scala> val partColumns = Seq("I").map(_.toLowerCase(java.util.Locale.ROOT)).toSet partColumns: scala.collection.immutable.Set[String] = Set(i) scala> Seq("a", "ı", "I").filter(a => partColumns.contains(a.toLowerCase(java.util.Locale.ROOT))) res14: Seq[String] = List(I) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110314541 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala --- @@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) extends Map[String, T] with Serializable { - val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase)) + val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase(Locale.ROOT))) --- End diff -- Maybe nitpicking and it is rarely possible I guess. However, up to my knowledge, this will affect the options users set, `spark.read.option(...)`. Namely, I think these case possible as below: ```scala scala> java.util.Locale.setDefault(new java.util.Locale("tr")) scala> val originalMap = Map("ı" -> 1, "I" -> 2) originalMap: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, I -> 2) ``` Before ```scala scala> originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase)) res6: scala.collection.immutable.Map[String,Int] = Map(ı -> 2) ``` After ```scala scala> originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase(java.util.Locale.ROOT))) res7: scala.collection.immutable.Map[String,Int] = Map(ı -> 1, i -> 2) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17527#discussion_r110317272 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala --- @@ -52,7 +54,11 @@ case class HadoopFsRelation( val schema: StructType = { val getColName: (StructField => String) = - if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name else _.name.toLowerCase + if (sparkSession.sessionState.conf.caseSensitiveAnalysis) { +_.name + } else { +_.name.toLowerCase(Locale.ROOT) + } --- End diff -- I think we should leave this out. It seems `dataSchema` means the schema from the source which is exposed to users. I think this could cause a problem. For example as below: Before ```scala import collection.mutable import org.apache.spark.sql.types._ java.util.Locale.setDefault(new java.util.Locale("tr")) val partitionSchema: StructType = StructType(StructField("I", StringType) :: Nil) val dataSchema: StructType = StructType(StructField("ı", StringType) :: Nil) val getColName: (StructField => String) = _.name.toLowerCase val overlappedPartCols = mutable.Map.empty[String, StructField] partitionSchema.foreach { partitionField => if (dataSchema.exists(getColName(_) == getColName(partitionField))) { overlappedPartCols += getColName(partitionField) -> partitionField } } val schema = StructType(dataSchema.map(f => overlappedPartCols.getOrElse(getColName(f), f)) ++ partitionSchema.filterNot(f => overlappedPartCols.contains(getColName(f schema.fieldNames ``` prints ```scala Array[String] = Array(I) ``` After ```scala import collection.mutable import org.apache.spark.sql.types._ java.util.Locale.setDefault(new java.util.Locale("tr")) val partitionSchema: StructType = StructType(StructField("I", StringType) :: Nil) val dataSchema: StructType = StructType(StructField("ı", StringType) :: Nil) val getColName: (StructField => String) = _.name.toLowerCase(java.util.Locale.ROOT) val overlappedPartCols = mutable.Map.empty[String, StructField] partitionSchema.foreach { partitionField => if (dataSchema.exists(getColName(_) == getColName(partitionField))) { overlappedPartCols += getColName(partitionField) -> partitionField } } val schema = StructType(dataSchema.map(f => overlappedPartCols.getOrElse(getColName(f), f)) ++ partitionSchema.filterNot(f => overlappedPartCols.contains(getColName(f schema.fieldNames ``` prints ```scala Array[String] = Array(ı, I) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110318802 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -134,7 +132,7 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr * For cost evaluation, since physical costs for operators are not available currently, we use * cardinalities and sizes to compute costs. */ -object JoinReorderDP extends PredicateHelper with Logging { +case class JoinReorderDP(conf: SQLConf) extends PredicateHelper with Logging { --- End diff -- @gatorsmile I would like to control the filters on top of the join enumeration. We might have other filters, e.g. left-deep trees only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110318621 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = +buildConf("spark.sql.cbo.joinReorder.dp.star.filter") + .doc("Applies star-join filter heuristics to cost based join enumeration.") + .booleanConf + .createWithDefault(false) --- End diff -- @gatorsmile Regardless of the default value, I still want to control the filters with their own knobs. The filters are applied on top of the join enumeration. They need to have their own control. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/17516 Don't we also need the skip if cran statement ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17516 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17516 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75589/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17516 **[Test build #75589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75589/testReport)** for PR 17516 at commit [`a3e8b35`](https://github.com/apache/spark/commit/a3e8b350c6ff6aff3b1537de64bfeda602d8aa11). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17557 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17557 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75588/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17557 **[Test build #75588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75588/testReport)** for PR 17557 at commit [`27e94fd`](https://github.com/apache/spark/commit/27e94fd6732edc50762cf6bc7e17e900ea1ff313). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110318101 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- @viirya The cost-based optimizer will find the best plan for the star-join. The star filter is a heuristic within join enumeration to limit the join sequences evaluated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17222 I'll try and follow up this weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17494 Thanks @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17494 LGTM as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17222 **[Test build #75591 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75591/testReport)** for PR 17222 at commit [`4da2994`](https://github.com/apache/spark/commit/4da29941bdaef13fb94bd0d16e63cba8c8d197bc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @viirya Thanks for careful review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17494 Thanks @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17222 LGTM, see if @marmbrus or @holdenk have any more comments about this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r110316824 --- Diff: python/pyspark/sql/tests.py --- @@ -436,6 +436,20 @@ def test_udf_with_order_by_and_limit(self): res.explain(True) self.assertEqual(res.collect(), [Row(id=0, copy=0)]) +def test_non_existed_udf(self): +try: +self.spark.udf.registerJavaFunction("udf1", "non_existed_udf") +self.fail("should fail due to can not load java udf class") +except py4j.protocol.Py4JError as e: +self.assertTrue("Can not load class non_existed_udf" in str(e)) + +def test_non_existed_udaf(self): +try: +self.spark.udf.registerJavaUDAF("udf1", "non_existed_udaf") --- End diff -- nit: udf1 -> udaf1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17558 **[Test build #75590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75590/testReport)** for PR 17558 at commit [`de5b5fe`](https://github.com/apache/spark/commit/de5b5fe5942bdea0fbd0a98ee11fcca035dccaf0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17546 This looks pretty good over all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110316465 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = --- End diff -- So we can have this as true by default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17558: [SPARK-20247][CORE] Add jar but this jar is missi...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/17558 [SPARK-20247][CORE] Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar ## What changes were proposed in this pull request? Catch exception when jar is missing, as [SPARK-20247](https://issues.apache.org/jira/browse/SPARK-20247) described. ## How was this patch tested? unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-20247 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17558.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17558 commit de5b5fe5942bdea0fbd0a98ee11fcca035dccaf0 Author: Yuming WangDate: 2017-04-07T04:51:01Z Catch exception when jar is missing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17516: [SPARK-20197][SPARKR] CRAN check fail with package insta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17516 **[Test build #75589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75589/testReport)** for PR 17516 at commit [`a3e8b35`](https://github.com/apache/spark/commit/a3e8b350c6ff6aff3b1537de64bfeda602d8aa11). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17557 **[Test build #75588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75588/testReport)** for PR 17557 at commit [`27e94fd`](https://github.com/apache/spark/commit/27e94fd6732edc50762cf6bc7e17e900ea1ff313). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17557: [SPARK-20208][WIP][R][DOCS] Document R fpGrowth s...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17557 [SPARK-20208][WIP][R][DOCS] Document R fpGrowth support ## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? TODO You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20208 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17557.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17557 commit 94d0cf2fcb3474b5c7217d85ebfe81819bd1dc9e Author: zero323Date: 2017-04-06T14:59:14Z List FP-growth among available algorithms commit 27e94fd6732edc50762cf6bc7e17e900ea1ff313 Author: zero323 Date: 2017-04-06T15:38:18Z Add basic description --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15770 Any update on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17553#discussion_r110315204 --- Diff: examples/src/main/r/ml/glm.R --- @@ -56,6 +56,15 @@ summary(binomialGLM) # Prediction binomialPredictions <- predict(binomialGLM, binomialTestDF) head(binomialPredictions) + +# Fit a generalized linear model of family "tweedie" with spark.glm +training3 <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") +tweedieDF <- transform(training3, label= training3$label * exp(randn(10))) --- End diff -- nite, style: `label = trai...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110314839 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala --- @@ -0,0 +1,428 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, StatsTestPlan} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.SQLConf._ + + +class StarJoinCostBasedReorderSuite extends PlanTest with StatsEstimationTestBase { + + override val conf = new SQLConf().copy( +CASE_SENSITIVE -> true, +CBO_ENABLED -> true, +JOIN_REORDER_ENABLED -> true, +STARSCHEMA_DETECTION -> true, +JOIN_REORDER_DP_STAR_FILTER -> true) + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Operator Optimizations", FixedPoint(100), +CombineFilters, +PushDownPredicate, +ReorderJoin(conf), +PushPredicateThroughJoin, +ColumnPruning, +CollapseProject) :: +Batch("Join Reorder", Once, + CostBasedJoinReorder(conf)) :: Nil + } + + private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq( +// F1 (fact table) +attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D1 (dimension) +attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D2 (dimension) +attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D3 (dimension) +attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), + +// T1 (regular table i.e. outside star) +attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 1, avgLen = 4, maxLen = 4),
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17556 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110314588 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala --- @@ -0,0 +1,428 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, StatsTestPlan} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.SQLConf._ + + +class StarJoinCostBasedReorderSuite extends PlanTest with StatsEstimationTestBase { + + override val conf = new SQLConf().copy( +CASE_SENSITIVE -> true, +CBO_ENABLED -> true, +JOIN_REORDER_ENABLED -> true, +STARSCHEMA_DETECTION -> true, +JOIN_REORDER_DP_STAR_FILTER -> true) + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Operator Optimizations", FixedPoint(100), +CombineFilters, +PushDownPredicate, +ReorderJoin(conf), +PushPredicateThroughJoin, +ColumnPruning, +CollapseProject) :: +Batch("Join Reorder", Once, + CostBasedJoinReorder(conf)) :: Nil + } + + private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq( +// F1 (fact table) +attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D1 (dimension) +attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D2 (dimension) +attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D3 (dimension) +attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), + +// T1 (regular table i.e. outside star) +attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 1, avgLen = 4, maxLen = 4),
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/17556 [SPARK-16957][MLlib] Use weighted midpoints for split values. ## What changes were proposed in this pull request? Use weighted midpoints for split values. ## How was this patch tested? + [x] add unit test. + [x] modify Split's unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/decision_tree_overflow_and_precision_in_aggregation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17556 commit 45b74930eea787411855fc35a7ad7198b35d577e Author: é¢åæï¼Yan Facaiï¼Date: 2017-04-07T04:02:13Z TST: add test case commit c49d3ae7db0e66855b0c896375b11bf51d9ac482 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-04-07T04:05:36Z ENH: use weighted midpoints commit 387eb498054289149706ecd2f88593d008fd074f Author: é¢åæï¼Yan Facaiï¼ Date: 2017-04-07T04:13:44Z BUG: constant feature, outOfIndex commit 2e68f1efca59772d1e905474c2392ad0d8b413c8 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-04-07T04:15:09Z TST: modify split's test case commit 6a5806f35185596ffda2c88c4879ecaf0be3bda1 Author: é¢åæï¼Yan Facaiï¼ Date: 2017-04-07T04:24:02Z CLN: move test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110314369 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = --- End diff -- @viirya Star join plans are expected to have an optimal execution based on their referential integrity constraints among the tables. It is a good heuristic. I expect that once CBO is enabled by default, star joins will also be enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110313675 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -54,14 +54,12 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr private def reorder(plan: LogicalPlan, output: Seq[Attribute]): LogicalPlan = { val (items, conditions) = extractInnerJoins(plan) -// TODO: Compute the set of star-joins and use them in the join enumeration -// algorithm to prune un-optimal plan choices. val result = // Do reordering if the number of items is appropriate and join conditions exist. // We also need to check if costs of all items can be evaluated. if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && conditions.nonEmpty && items.forall(_.stats(conf).rowCount.isDefined)) { -JoinReorderDP.search(conf, items, conditions, output) +JoinReorderDP(conf).search(conf, items, conditions, output) --- End diff -- Revert it back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110313661 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -134,7 +132,7 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr * For cost evaluation, since physical costs for operators are not available currently, we use * cardinalities and sizes to compute costs. */ -object JoinReorderDP extends PredicateHelper with Logging { +case class JoinReorderDP(conf: SQLConf) extends PredicateHelper with Logging { --- End diff -- Revert it back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110313369 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = +buildConf("spark.sql.cbo.joinReorder.dp.star.filter") + .doc("Applies star-join filter heuristics to cost based join enumeration.") + .booleanConf + .createWithDefault(false) --- End diff -- cc @wzhfy @ron8hu @sameeragarwal @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110313349 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = +buildConf("spark.sql.cbo.joinReorder.dp.star.filter") + .doc("Applies star-join filter heuristics to cost based join enumeration.") + .booleanConf + .createWithDefault(false) --- End diff -- The logics will be enabled if and only if both `conf.cboEnabled` and `conf.joinReorderEnabled` are true. Thus, it is safe to be `true` by default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17552#discussion_r110312633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.catalog.CatalogTable -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference} +import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.sources.BaseRelation import org.apache.spark.util.Utils /** * Used to link a [[BaseRelation]] in to a logical query plan. - * - * Note that sometimes we need to use `LogicalRelation` to replace an existing leaf node without - * changing the output attributes' IDs. The `expectedOutputAttributes` parameter is used for - * this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details. */ case class LogicalRelation( relation: BaseRelation, -expectedOutputAttributes: Option[Seq[Attribute]] = None, -catalogTable: Option[CatalogTable] = None) +output: Seq[AttributeReference], +catalogTable: Option[CatalogTable]) extends LeafNode with MultiInstanceRelation { - override val output: Seq[AttributeReference] = { -val attrs = relation.schema.toAttributes -expectedOutputAttributes.map { expectedAttrs => - assert(expectedAttrs.length == attrs.length) - attrs.zip(expectedAttrs).map { -// We should respect the attribute names provided by base relation and only use the -// exprId in `expectedOutputAttributes`. -// The reason is that, some relations(like parquet) will reconcile attribute names to -// workaround case insensitivity issue. -case (attr, expected) => attr.withExprId(expected.exprId) --- End diff -- Agree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17552 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17552 **[Test build #75587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75587/testReport)** for PR 17552 at commit [`0fbd4a6`](https://github.com/apache/spark/commit/0fbd4a65f4c8242626fb35029cb22ce502dc696f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17552#discussion_r110311641 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.catalog.CatalogTable -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference} +import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.sources.BaseRelation import org.apache.spark.util.Utils /** * Used to link a [[BaseRelation]] in to a logical query plan. - * - * Note that sometimes we need to use `LogicalRelation` to replace an existing leaf node without - * changing the output attributes' IDs. The `expectedOutputAttributes` parameter is used for - * this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details. */ case class LogicalRelation( relation: BaseRelation, -expectedOutputAttributes: Option[Seq[Attribute]] = None, -catalogTable: Option[CatalogTable] = None) +output: Seq[AttributeReference], +catalogTable: Option[CatalogTable]) extends LeafNode with MultiInstanceRelation { - override val output: Seq[AttributeReference] = { -val attrs = relation.schema.toAttributes -expectedOutputAttributes.map { expectedAttrs => - assert(expectedAttrs.length == attrs.length) - attrs.zip(expectedAttrs).map { -// We should respect the attribute names provided by base relation and only use the -// exprId in `expectedOutputAttributes`. -// The reason is that, some relations(like parquet) will reconcile attribute names to -// workaround case insensitivity issue. -case (attr, expected) => attr.withExprId(expected.exprId) --- End diff -- good catch! I found this logic is only useful when converting hive tables to data source tables, so I moved the logic there: https://github.com/apache/spark/pull/17552/files#diff-ee66e11b56c21364760a5ed2b783f863R215 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110309359 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala --- @@ -0,0 +1,428 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, StatsTestPlan} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.SQLConf._ + + +class StarJoinCostBasedReorderSuite extends PlanTest with StatsEstimationTestBase { + + override val conf = new SQLConf().copy( +CASE_SENSITIVE -> true, +CBO_ENABLED -> true, +JOIN_REORDER_ENABLED -> true, +STARSCHEMA_DETECTION -> true, +JOIN_REORDER_DP_STAR_FILTER -> true) + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Operator Optimizations", FixedPoint(100), +CombineFilters, +PushDownPredicate, +ReorderJoin(conf), +PushPredicateThroughJoin, +ColumnPruning, +CollapseProject) :: +Batch("Join Reorder", Once, + CostBasedJoinReorder(conf)) :: Nil + } + + private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq( +// F1 (fact table) +attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D1 (dimension) +attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D2 (dimension) +attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D3 (dimension) +attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), + +// T1 (regular table i.e. outside star) +attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 1, avgLen = 4, maxLen = 4), +
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110309073 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala --- @@ -0,0 +1,428 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.catalyst.statsEstimation.{StatsEstimationTestBase, StatsTestPlan} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.SQLConf._ + + +class StarJoinCostBasedReorderSuite extends PlanTest with StatsEstimationTestBase { + + override val conf = new SQLConf().copy( +CASE_SENSITIVE -> true, +CBO_ENABLED -> true, +JOIN_REORDER_ENABLED -> true, +STARSCHEMA_DETECTION -> true, +JOIN_REORDER_DP_STAR_FILTER -> true) + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Operator Optimizations", FixedPoint(100), +CombineFilters, +PushDownPredicate, +ReorderJoin(conf), +PushPredicateThroughJoin, +ColumnPruning, +CollapseProject) :: +Batch("Join Reorder", Once, + CostBasedJoinReorder(conf)) :: Nil + } + + private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq( +// F1 (fact table) +attr("f1_fk1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_fk3") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c1") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("f1_c2") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D1 (dimension) +attr("d1_pk") -> ColumnStat(distinctCount = 100, min = Some(1), max = Some(100), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c2") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d1_c3") -> ColumnStat(distinctCount = 50, min = Some(1), max = Some(50), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D2 (dimension) +attr("d2_pk") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c2") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d2_c3") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), + +// D3 (dimension) +attr("d3_pk") -> ColumnStat(distinctCount = 10, min = Some(1), max = Some(10), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c2") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), +attr("d3_c3") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), + nullCount = 0, avgLen = 4, maxLen = 4), + +// T1 (regular table i.e. outside star) +attr("t1_c1") -> ColumnStat(distinctCount = 20, min = Some(1), max = Some(20), + nullCount = 1, avgLen = 4, maxLen = 4), +
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110308327 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -736,6 +736,12 @@ object SQLConf { .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") .createWithDefault(0.7) + val JOIN_REORDER_DP_STAR_FILTER = --- End diff -- Is there any cases we don't want to enable this if cbo is enabled? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110307898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && --- End diff -- ok for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110307786 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- oh. right. forget that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110307666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- `ReorderJoin` is done heuristically. It can be useful when cbo is off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110306486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- So do we still need `ReorderJoin`? Looks like we don't need it anymore if we don't care about the order created by it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17555: [SPARK-19495][SQL] Make SQLConf slightly more ext...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17555 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17555 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110305903 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17555 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75586/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17555 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17555 **[Test build #75586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75586/testReport)** for PR 17555 at commit [`6084d95`](https://github.com/apache/spark/commit/6084d9507a19ddf4e4521bd28e1f96886d3a252e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17495: [SPARK-20172][Core] Add file permission check when listi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/17495 Ping @vanzin @tgravescs again. Sorry to bother you and really appreciate your time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/14617 I see. The current code leverages `SparkListenerBlockUpdated` event to calculate memory usage, let me try to investigate the feasibility of using `taskEnd.taskMetrics.updatedBlocks`, to see if it is possible to use this instead to calculate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user squito commented on the issue: https://github.com/apache/spark/pull/14617 yeah, we definitely don't want to start logging more events. But it seems like this info is already available -- taskEnd.taskMetrics.updatedBlocks already has everything, doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17534: [SPARK-20218]'/applications/[app-id]/stages' in R...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/17534#discussion_r110303522 --- Diff: docs/monitoring.md --- @@ -299,12 +299,12 @@ can be identified by their `[attempt-id]`. In the API listed below, when running /applications/[app-id]/stages A list of all stages for a given application. +?status=[active|complete|pending|failed] list only stages in the state. /applications/[app-id]/stages/[stage-id] A list of all attempts for the given stage. - ?status=[active|complete|pending|failed] list only stages in the state. --- End diff -- It is filtering stages. I have manual test this API locally and confirm this doesn't have effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110303409 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- If so, with this added filter, `CostBasedJoinReorder` can also let the star join plans together, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110302895 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- > Doesn't this cost-based join reorder rule breaks the order created by ReorderJoin? This is expected from cost based reordering. `ReorderJoin` only puts connected items together, the order among these items is not optimized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110300420 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} + * + * level 0: (f1 ), (d2 ), (t3 ), (d1 ), (t1 ), (t2 ) + * level 1: {t3 t2 }, {f1 d2 }, {f1 d1 } + * level 2: {d2 f1 d1 } + * level 3: {t1 d1 f1 d2 }, {t2 d1 f1 d2 } + * level 4: {d1 t2 f1 t1 d2 }, {d1 t3 t2 f1 d2 } + * level 5: {d1 t3 t2 f1 t1 d2 } + */ + def starJoinFilter( + outer: Set[Int], + inner: Set[Int], + filters: JoinGraphInfo) : Boolean = { +val starJoins = filters.starJoins +val nonStarJoins = filters.nonStarJoins +val join = outer.union(inner) + +// Disjoint sets +outer.intersect(inner).isEmpty && + // Either star or non-star is empty + (starJoins.isEmpty || nonStarJoins.isEmpty || +// Join is a subset of the star-join +join.subsetOf(starJoins) || +// Star-join is a subset of join +starJoins.subsetOf(join) || --- End diff -- `ReorderJoin` will reorder the star join plans. Doesn't this cost-based join reorder rule breaks the order created by `ReorderJoin`? Here we only ask this rule doesn't try to reorder part of star join plans and non-star join plans, but it still can reorder the order among star join plans. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/14617 Thanks @squito . Regarding showing memory usage in history server. My major concern is that putting so many block update event into event log will significantly increase the file size and delay the replay, that's why in the current code we deliberately bypass the block update event. And IIUC in history server it is not necessary to show the change of used memory, only the final memory usage before application finished will be shown on the UI. So instead of recording and replaying all the block update events, just recording the final memory usage of each executor is enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17551: [SPARK-20242][Web UI] Add spark.ui.stopDelay
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/17551 @barnardb only in Spark standalone mode HistoryServer is embedded into Master process for convenience IIRC. You can always start a standalone HistoryServer process. Also `FsHistoryProvider` is not bound to HDFS, other Hadoop compatible FS could be supported, like wasb, s3 and other object stores that has Hadoop FS compatible layer. I would think even in your cluster environment (k8s), you probably have a object store. And at least you could implement a customized `ApplicationHistoryProvider`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17553 Could you add [SPARKR] to the PR title please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16648 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16648 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75585/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16648 **[Test build #75585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75585/testReport)** for PR 16648 at commit [`320db91`](https://github.com/apache/spark/commit/320db918d8064069907483610e8389b4a4d706c5). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15009 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75584/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15009 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15009 **[Test build #75584 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75584/testReport)** for PR 15009 at commit [`0cfd4a7`](https://github.com/apache/spark/commit/0cfd4a7eb540e751a03b9d8e78af4e8f6e3be62c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17555: [SPARK-19495][SQL] Make SQLConf slightly more extensible...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17555 **[Test build #75586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75586/testReport)** for PR 17555 at commit [`6084d95`](https://github.com/apache/spark/commit/6084d9507a19ddf4e4521bd28e1f96886d3a252e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17555: [SPARK-19495][SQL] Make SQLConf slightly more ext...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/17555 [SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum ## What changes were proposed in this pull request? This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file. ## How was this patch tested? N/A - no semantic change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-19495-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17555.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17555 commit 6084d9507a19ddf4e4521bd28e1f96886d3a252e Author: Reynold XinDate: 2017-04-06T23:59:15Z [SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17554: [MINOR][DOCS] Fix typo in Hive Examples
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17554 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17554: [MINOR][DOCS] Fix typo in Hive Examples
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17554 Thanks - merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17552: [SPARK-20245][SQL][minor] pass output to LogicalRelation...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17552 LGTM except only one comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17552: [SPARK-20245][SQL][minor] pass output to LogicalR...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17552#discussion_r110287008 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -18,39 +18,21 @@ package org.apache.spark.sql.execution.datasources import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.catalog.CatalogTable -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference} +import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.sources.BaseRelation import org.apache.spark.util.Utils /** * Used to link a [[BaseRelation]] in to a logical query plan. - * - * Note that sometimes we need to use `LogicalRelation` to replace an existing leaf node without - * changing the output attributes' IDs. The `expectedOutputAttributes` parameter is used for - * this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details. */ case class LogicalRelation( relation: BaseRelation, -expectedOutputAttributes: Option[Seq[Attribute]] = None, -catalogTable: Option[CatalogTable] = None) +output: Seq[AttributeReference], +catalogTable: Option[CatalogTable]) extends LeafNode with MultiInstanceRelation { - override val output: Seq[AttributeReference] = { -val attrs = relation.schema.toAttributes -expectedOutputAttributes.map { expectedAttrs => - assert(expectedAttrs.length == attrs.length) - attrs.zip(expectedAttrs).map { -// We should respect the attribute names provided by base relation and only use the -// exprId in `expectedOutputAttributes`. -// The reason is that, some relations(like parquet) will reconcile attribute names to -// workaround case insensitivity issue. -case (attr, expected) => attr.withExprId(expected.exprId) --- End diff -- It sounds like this logics mentioned in the comments is removed by this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/17092 Ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16966 Ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17546 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75583/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17546 **[Test build #75583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75583/testReport)** for PR 17546 at commit [`9e81154`](https://github.com/apache/spark/commit/9e81154f94441e78b4b3ac0cd20f53746276d030). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17554: [MINOR][DOCS] Fix typo in Hive Examples
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17554 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17554: [MINOR][DOCS] Fix typo in Hive Examples
GitHub user cooper6581 opened a pull request: https://github.com/apache/spark/pull/17554 [MINOR][DOCS] Fix typo in Hive Examples ## What changes were proposed in this pull request? Fix typo in hive examples from "DaraFrames" to "DataFrames" ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cooper6581/spark typo-daraframes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17554.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17554 commit a186830bf9c637c159c5673e52a89ac95f574eba Author: Dustin KoupalDate: 2017-04-06T21:16:47Z fix typo in hive examples --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16648 **[Test build #75585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75585/testReport)** for PR 16648 at commit [`320db91`](https://github.com/apache/spark/commit/320db918d8064069907483610e8389b4a4d706c5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15009 **[Test build #75584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75584/testReport)** for PR 15009 at commit [`0cfd4a7`](https://github.com/apache/spark/commit/0cfd4a7eb540e751a03b9d8e78af4e8f6e3be62c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17551: [SPARK-20242][Web UI] Add spark.ui.stopDelay
Github user barnardb commented on the issue: https://github.com/apache/spark/pull/17551 > It's still running your code, right? Why can't you add a configuration to your own code that tells it to wait some time before shutting down the SparkContext? We're trying to support arbitrary jobs running on the cluster, to make it easy for users to inspect the jobs that they run there. This was a quick way to achieve that, but I agree with the other commenters that this quite hacky, and that the history server would be a nicer solution. Our problem with the history server right now is that while the current driver-side `EventLoggingListener` + history-server-side `FsHistoryProvider` implementations are great for environments with HDFS, they're much less convenient in a cluster without a distributed filesystem. I'd propose that I close this PR, and work on an RPC-based listener-provider combination to use with the history server. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user ioana-delaney commented on the issue: https://github.com/apache/spark/pull/17546 @wzhfy Yes, star-schema is called from both ```ReorderJoin``` and ```CostBasedJoinReorder```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17546 **[Test build #75583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75583/testReport)** for PR 17546 at commit [`9e81154`](https://github.com/apache/spark/commit/9e81154f94441e78b4b3ac0cd20f53746276d030). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17541 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75580/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17541 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17541: [SPARK-20229][SQL] add semanticHash to QueryPlan
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17541 **[Test build #75580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75580/testReport)** for PR 17541 at commit [`99f8ad3`](https://github.com/apache/spark/commit/99f8ad3536daae74340fd6ae59236e291cfdeb84). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17553: [SPARK-20026][Doc] Add Tweedie example for SparkR in pro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17553 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org