[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- Another option would just be to do this at construction time, that way we can avoid paying the cost in the analyzer. This would still limit the cases we could cache (i.e. we'd miss cached data unioned with other data), but that doesn't seem like a huge deal. I'd leave this rule here either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7675][ML][PYSpark] sparkml params type ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9581#issuecomment-168930029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48746/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168932871 **[Test build #48751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48751/consoleFull)** for PR 9858 at commit [`dd2bdc8`](https://github.com/apache/spark/commit/dd2bdc8650e9db763ec3afe290919d8a15404e9d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12570] [ML] [Doc] DecisionTreeRegressor...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10594 [SPARK-12570] [ML] [Doc] DecisionTreeRegressor: provide variance of prediction: user guide update You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-12570 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10594.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10594 commit 94273d1512eded0148d02b7a76925ee4a40d8039 Author: Yanbo LiangDate: 2016-01-05T08:21:33Z DecisionTreeRegressor: provide variance of prediction: user guide update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168932606 **[Test build #48747 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48747/consoleFull)** for PR 10583 at commit [`fb3b4a4`](https://github.com/apache/spark/commit/fb3b4a4c461391866bc12a51dd1e60eadeaff916). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48821710 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -240,6 +241,23 @@ private[hive] class HiveMetastoreCatalog(val client: ClientInterface, hive: Hive } } +if (userSpecifiedSchema.isDefined && bucketSpec.isDefined) { + val BucketSpec(numBuckets, bucketColumns, sortColumns) = bucketSpec.get + + tableProperties.put("spark.sql.sources.schema.numBuckets", numBuckets.toString) + tableProperties.put("spark.sql.sources.schema.numBucketCols", bucketColumns.length.toString) + bucketColumns.zipWithIndex.foreach { case (bucketCol, index) => +tableProperties.put(s"spark.sql.sources.schema.bucketCol.$index", bucketCol) + } + + if (sortColumns.isDefined) { +tableProperties.put("spark.sql.sources.schema.numSortCols", sortColumns.get.length.toString) --- End diff -- are we worried about the 4k limit and as a result want to limit the size of each property? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10595#issuecomment-168933998 Can you combine your pull requests into a single one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48821401 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -240,6 +241,23 @@ private[hive] class HiveMetastoreCatalog(val client: ClientInterface, hive: Hive } } +if (userSpecifiedSchema.isDefined && bucketSpec.isDefined) { + val BucketSpec(numBuckets, bucketColumns, sortColumns) = bucketSpec.get + + tableProperties.put("spark.sql.sources.schema.numBuckets", numBuckets.toString) + tableProperties.put("spark.sql.sources.schema.numBucketCols", bucketColumns.length.toString) + bucketColumns.zipWithIndex.foreach { case (bucketCol, index) => +tableProperties.put(s"spark.sql.sources.schema.bucketCol.$index", bucketCol) + } + + if (sortColumns.isDefined) { +tableProperties.put("spark.sql.sources.schema.numSortCols", sortColumns.get.length.toString) --- End diff -- It's only used to read the sorting columns back, which is the same technology we used to store partitioned columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/10595 [STREAMING][MINOR] More contextual information in logs + minor code i⦠â¦mprovements Please review and merge at your convenience. Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark streaming-minor-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10595 commit 62129336a171479b37edc347255a7be226fd2d22 Author: Jacek LaskowskiDate: 2016-01-05T08:25:00Z [STREAMING][MINOR] More contextual information in logs + minor code improvements --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168934153 Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168929936 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7675][ML][PYSpark] sparkml params type ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9581#issuecomment-168930028 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7675][ML][PYSpark] sparkml params type ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9581#issuecomment-168929886 **[Test build #48746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48746/consoleFull)** for PR 9581 at commit [`954f7c6`](https://github.com/apache/spark/commit/954f7c68f1e38aa80f33994f588c1bceb47679e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48820170 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -189,13 +220,43 @@ final class DataFrameWriter private[sql](df: DataFrame) { ifNotExists = false)).toRdd } - private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { parCols => -parCols.map { col => - df.logicalPlan.output -.map(_.name) -.find(df.sqlContext.analyzer.resolver(_, col)) -.getOrElse(throw new AnalysisException(s"Partition column $col not found in existing " + - s"columns (${df.logicalPlan.output.map(_.name).mkString(", ")})")) + private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { cols => +cols.map(normalize(_, "Partition")) + } + + private def normalizedBucketCols: Option[Seq[String]] = bucketingColumns.map { cols => +cols.map(normalize(_, "Bucketing")) + } + + private def normalizedSortCols: Option[Seq[String]] = sortingColumns.map { cols => +cols.map(normalize(_, "Sorting")) + } + + private def getBucketSpec: Option[BucketSpec] = { +if (sortingColumns.isDefined) { + require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +} + +for { + n <- numBuckets + cols <- normalizedBucketCols +} yield { + require(n > 0, "Bucket number must be greater than 0.") + BucketSpec(n, cols, normalizedSortCols) +} + } + + private def normalize(columnName: String, columnType: String): String = { +val validColumnNames = df.logicalPlan.output.map(_.name) +validColumnNames.find(df.sqlContext.analyzer.resolver(_, columnName)) + .getOrElse(throw new AnalysisException(s"$columnType column $columnName not found in " + +s"existing columns (${validColumnNames.mkString(", ")})")) + } + + private def assertNotBucketed(): Unit = { +if (numBuckets.isDefined || sortingColumns.isDefined) { --- End diff -- I think sorting columns make no sense without bucketing columns, cc @nongli @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48742/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-168931880 **[Test build #48748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48748/consoleFull)** for PR 10527 at commit [`0558bf8`](https://github.com/apache/spark/commit/0558bf8b698e9de7e19625627e487bfb3f33072d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931761 **[Test build #48742 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48742/consoleFull)** for PR 10150 at commit [`0310efe`](https://github.com/apache/spark/commit/0310efeec1a202733b40a50085178ec1b1d97409). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168931844 **[Test build #2322 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2322/consoleFull)** for PR 10583 at commit [`fb3b4a4`](https://github.com/apache/spark/commit/fb3b4a4c461391866bc12a51dd1e60eadeaff916). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12581][SQL] Support case-sensitive tabl...
Github user maropu commented on the pull request: https://github.com/apache/spark/pull/10523#issuecomment-168932900 @liancheng @yhuai Could you review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931943 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10589#issuecomment-168932322 **[Test build #48738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48738/consoleFull)** for PR 10589 at commit [`22afd1f`](https://github.com/apache/spark/commit/22afd1f0115b86cdb5ba661dd2c0714ff6a4243b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) ` * ` case class Case(name: String, fn: Int => Unit)` * ` case class Result(avgMs: Double, avgRate: Double)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48820931 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -189,13 +220,43 @@ final class DataFrameWriter private[sql](df: DataFrame) { ifNotExists = false)).toRdd } - private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { parCols => -parCols.map { col => - df.logicalPlan.output -.map(_.name) -.find(df.sqlContext.analyzer.resolver(_, col)) -.getOrElse(throw new AnalysisException(s"Partition column $col not found in existing " + - s"columns (${df.logicalPlan.output.map(_.name).mkString(", ")})")) + private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { cols => +cols.map(normalize(_, "Partition")) + } + + private def normalizedBucketCols: Option[Seq[String]] = bucketingColumns.map { cols => +cols.map(normalize(_, "Bucketing")) + } + + private def normalizedSortCols: Option[Seq[String]] = sortingColumns.map { cols => +cols.map(normalize(_, "Sorting")) + } + + private def getBucketSpec: Option[BucketSpec] = { +if (sortingColumns.isDefined) { + require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +} + +for { + n <- numBuckets + cols <- normalizedBucketCols +} yield { + require(n > 0, "Bucket number must be greater than 0.") + BucketSpec(n, cols, normalizedSortCols) +} + } + + private def normalize(columnName: String, columnType: String): String = { +val validColumnNames = df.logicalPlan.output.map(_.name) +validColumnNames.find(df.sqlContext.analyzer.resolver(_, columnName)) + .getOrElse(throw new AnalysisException(s"$columnType column $columnName not found in " + +s"existing columns (${validColumnNames.mkString(", ")})")) + } + + private def assertNotBucketed(): Unit = { +if (numBuckets.isDefined || sortingColumns.isDefined) { --- End diff -- Your point makes sense if you look at it from the implementation's perspective, but if I'm an user, why do I have to call bucketBy in order to use sortBy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820999 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- To do this at construction time, we need to introduce a new Dataframe API `unionAll` that can combine more than two Dataframes? @marmbrus @rxin Is my understanding correct? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12393] [SparkR] Add read.text and write...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10348#issuecomment-168933554 ping @shivaram @sun-rui @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48820313 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -189,13 +220,43 @@ final class DataFrameWriter private[sql](df: DataFrame) { ifNotExists = false)).toRdd } - private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { parCols => -parCols.map { col => - df.logicalPlan.output -.map(_.name) -.find(df.sqlContext.analyzer.resolver(_, col)) -.getOrElse(throw new AnalysisException(s"Partition column $col not found in existing " + - s"columns (${df.logicalPlan.output.map(_.name).mkString(", ")})")) + private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { cols => +cols.map(normalize(_, "Partition")) + } + + private def normalizedBucketCols: Option[Seq[String]] = bucketingColumns.map { cols => +cols.map(normalize(_, "Bucketing")) + } + + private def normalizedSortCols: Option[Seq[String]] = sortingColumns.map { cols => +cols.map(normalize(_, "Sorting")) + } + + private def getBucketSpec: Option[BucketSpec] = { +if (sortingColumns.isDefined) { + require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +} + +for { + n <- numBuckets + cols <- normalizedBucketCols +} yield { + require(n > 0, "Bucket number must be greater than 0.") + BucketSpec(n, cols, normalizedSortCols) +} + } + + private def normalize(columnName: String, columnType: String): String = { +val validColumnNames = df.logicalPlan.output.map(_.name) +validColumnNames.find(df.sqlContext.analyzer.resolver(_, columnName)) + .getOrElse(throw new AnalysisException(s"$columnType column $columnName not found in " + +s"existing columns (${validColumnNames.mkString(", ")})")) + } + + private def assertNotBucketed(): Unit = { +if (numBuckets.isDefined || sortingColumns.isDefined) { --- End diff -- isn't it the same as our normal DataFrame.sort? It still increases compression ratio for Parquet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820341 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48820679 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -189,13 +220,43 @@ final class DataFrameWriter private[sql](df: DataFrame) { ifNotExists = false)).toRdd } - private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { parCols => -parCols.map { col => - df.logicalPlan.output -.map(_.name) -.find(df.sqlContext.analyzer.resolver(_, col)) -.getOrElse(throw new AnalysisException(s"Partition column $col not found in existing " + - s"columns (${df.logicalPlan.output.map(_.name).mkString(", ")})")) + private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { cols => +cols.map(normalize(_, "Partition")) + } + + private def normalizedBucketCols: Option[Seq[String]] = bucketingColumns.map { cols => +cols.map(normalize(_, "Bucketing")) + } + + private def normalizedSortCols: Option[Seq[String]] = sortingColumns.map { cols => +cols.map(normalize(_, "Sorting")) + } + + private def getBucketSpec: Option[BucketSpec] = { +if (sortingColumns.isDefined) { + require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +} + +for { + n <- numBuckets + cols <- normalizedBucketCols +} yield { + require(n > 0, "Bucket number must be greater than 0.") + BucketSpec(n, cols, normalizedSortCols) +} + } + + private def normalize(columnName: String, columnType: String): String = { +val validColumnNames = df.logicalPlan.output.map(_.name) +validColumnNames.find(df.sqlContext.analyzer.resolver(_, columnName)) + .getOrElse(throw new AnalysisException(s"$columnType column $columnName not found in " + +s"existing columns (${validColumnNames.mkString(", ")})")) + } + + private def assertNotBucketed(): Unit = { +if (numBuckets.isDefined || sortingColumns.isDefined) { --- End diff -- If users just wanna sort the data, they can call `DataFrame.sort` before write. In this context, the `sortingColumns` is part of the bucketing information and should be used together with `bucketingColumns`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168931722 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48741/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168931720 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168931526 **[Test build #48741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48741/consoleFull)** for PR 10588 at commit [`b652b45`](https://github.com/apache/spark/commit/b652b4548fb2b9270f7ebd11397fdbc09a89f583). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
GitHub user nongli opened a pull request: https://github.com/apache/spark/pull/10593 [SPARK-12644][SQL] Update parquet reader to be vectorized. This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nongli/spark spark-12644 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10593.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10593 commit 7eeff58298ceac076779a5cae05ca674ed0ac51a Author: NongDate: 2015-12-31T22:45:30Z [SPARK-12636][SQL] Update UnsafeRowParquetRecordReader to support reading paths directly. As noted in the code, this change is to make this componenet easier to test in isolation. commit 22afd1f0115b86cdb5ba661dd2c0714ff6a4243b Author: Nong Date: 2016-01-01T00:26:34Z [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. commit 3e41ed43ebc16f4ea0f2a642dbf3a5e40a8bd0d9 Author: Nong Date: 2016-01-01T05:12:44Z [SPARK-12635][SQL] Add ColumnarBatch, an in memory columnar format for execution. There are many potential benefits of having an efficient in memory columnar format as an alternate to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The remaining implementation can be done as follow up patches. As stated in the in the JIRA, there are useful external components that operate on memory in a simple columnar format. ColumnarBatch would serve that purpose and could server as a zero-serialization/zero-copy exchange for this use case. This patch supports running the underlying data either on heap or off heap. On heap runs a bit faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one interface (ColumnVector). This differs from Parquet or the existing columnar cache because this is *not* intended to be used as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these batches in memory per task. commit d99659d89a7709df8223ab86b1edd244b1e63086 Author: Nong Date: 2016-01-01T07:28:06Z [SPARK-12644][SQL] Update parquet reader to be vectorized. This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10589#issuecomment-168932593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10589#issuecomment-168932596 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48738/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168933497 **[Test build #48750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48750/consoleFull)** for PR 10588 at commit [`f3a557b`](https://github.com/apache/spark/commit/f3a557b5534c506e6987388a84ae4e561585d895). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10595#issuecomment-16890 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10593#issuecomment-168934496 **[Test build #48749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48749/consoleFull)** for PR 10593 at commit [`d99659d`](https://github.com/apache/spark/commit/d99659d89a7709df8223ab86b1edd244b1e63086). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class VectorizedPlainValuesReader extends ValuesReader implements VectorizedValuesReader ` * `public final class VectorizedRleValuesReader extends ValuesReader ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10593#issuecomment-168934501 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48749/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10593#issuecomment-168934499 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10590#issuecomment-168934843 **[Test build #48740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48740/consoleFull)** for PR 10590 at commit [`4223cca`](https://github.com/apache/spark/commit/4223ccac9984d07aa858deb00caac4bba5ddc406). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12539][SQL] support writing bucketed ta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10498#discussion_r48822540 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -189,13 +220,43 @@ final class DataFrameWriter private[sql](df: DataFrame) { ifNotExists = false)).toRdd } - private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { parCols => -parCols.map { col => - df.logicalPlan.output -.map(_.name) -.find(df.sqlContext.analyzer.resolver(_, col)) -.getOrElse(throw new AnalysisException(s"Partition column $col not found in existing " + - s"columns (${df.logicalPlan.output.map(_.name).mkString(", ")})")) + private def normalizedParCols: Option[Seq[String]] = partitioningColumns.map { cols => +cols.map(normalize(_, "Partition")) + } + + private def normalizedBucketCols: Option[Seq[String]] = bucketingColumns.map { cols => +cols.map(normalize(_, "Bucketing")) + } + + private def normalizedSortCols: Option[Seq[String]] = sortingColumns.map { cols => +cols.map(normalize(_, "Sorting")) + } + + private def getBucketSpec: Option[BucketSpec] = { +if (sortingColumns.isDefined) { + require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +} + +for { + n <- numBuckets + cols <- normalizedBucketCols +} yield { + require(n > 0, "Bucket number must be greater than 0.") + BucketSpec(n, cols, normalizedSortCols) +} + } + + private def normalize(columnName: String, columnType: String): String = { +val validColumnNames = df.logicalPlan.output.map(_.name) +validColumnNames.find(df.sqlContext.analyzer.resolver(_, columnName)) + .getOrElse(throw new AnalysisException(s"$columnType column $columnName not found in " + +s"existing columns (${validColumnNames.mkString(", ")})")) + } + + private def assertNotBucketed(): Unit = { +if (numBuckets.isDefined || sortingColumns.isDefined) { --- End diff -- maybe we need a better name rather than `sortBy` to indicate that users need to give columns which will be used to sort the data in each bucket. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12570] [ML] [Doc] DecisionTreeRegressor...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10594#issuecomment-168935497 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/10596 [SPARK-12401][SQL] Add integration tests for postgres enum types We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark AddTestsInIntegration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10596.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10596 commit 6460aca95eccd1249d03546b9deeb90f3f5f02e9 Author: Takeshi YAMAMURODate: 2016-01-05T04:41:13Z Add tests for postgres enum types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12570] [ML] [Doc] DecisionTreeRegressor...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/10594#discussion_r48824613 --- Diff: docs/ml-classification-regression.md --- @@ -535,7 +535,9 @@ The main differences between this API and the [original MLlib Decision Tree API] * use of DataFrame metadata to distinguish continuous and categorical features -The Pipelines API for Decision Trees offers a bit more functionality than the original API. In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities). +The Pipelines API for Decision Trees offers a bit more functionality than the original API. +In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities); --- End diff -- My bad, I didnt understand the sentence correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12625][SPARKR][SQL] replace R usage of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10584#issuecomment-168940130 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48732/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168943487 Todo: - Will add the new `Dataframe` and `Dataset` APIs for `unionAll`, if my understanding is correct. - Will add another rule for pushing `Filter` and `Project` through `Unions`. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11579] [ML] avoid creating new optimize...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9614#issuecomment-168946249 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11579] [ML] avoid creating new optimize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9614#issuecomment-168945836 **[Test build #48753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48753/consoleFull)** for PR 9614 at commit [`dcf0d8f`](https://github.com/apache/spark/commit/dcf0d8ff111cdb6812cb5ff74d0119331270b644). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168952375 **[Test build #48750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48750/consoleFull)** for PR 10588 at commit [`f3a557b`](https://github.com/apache/spark/commit/f3a557b5534c506e6987388a84ae4e561585d895). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168952176 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48751/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168955204 **[Test build #2322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2322/consoleFull)** for PR 10583 at commit [`fb3b4a4`](https://github.com/apache/spark/commit/fb3b4a4c461391866bc12a51dd1e60eadeaff916). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] Scaladoc fixes...mostly
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10592#issuecomment-168954471 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11624][SPARK-11972][SQL]fix commands th...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/9589#discussion_r48829056 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala --- @@ -151,29 +152,34 @@ private[hive] class ClientWrapper( // Switch to the initClassLoader. Thread.currentThread().setContextClassLoader(initClassLoader) val ret = try { - val initialConf = new HiveConf(classOf[SessionState]) - // HiveConf is a Hadoop Configuration, which has a field of classLoader and - // the initial value will be the current thread's context class loader - // (i.e. initClassLoader at here). - // We call initialConf.setClassLoader(initClassLoader) at here to make - // this action explicit. - initialConf.setClassLoader(initClassLoader) - config.foreach { case (k, v) => -if (k.toLowerCase.contains("password")) { - logDebug(s"Hive Config: $k=xxx") -} else { - logDebug(s"Hive Config: $k=$v") + val registeredState = SessionState.get + if (registeredState != null && registeredState.isInstanceOf[CliSessionState]) { --- End diff -- When we have a `CliSessionState`, we are using Spark SQL CLI, in this case we never need second `SessionState` here. Creating another `SessionState` would fail some cases since `CliSessionState` is inherited from `SessionState`, which could lead to `ClassCastException`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12453][Streaming] Remove explicit depen...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10492#issuecomment-168958835 Given the discussion here, I'm pretty confident in this change and would like to go ahead and merge it. It will also unblock further fixes in 12269. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-168963321 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12470] [SQL] Fix size reduction calcula...
Github user robbinspg commented on the pull request: https://github.com/apache/spark/pull/10421#issuecomment-168963270 I have a fix for the test failure. Should I create a new Jira and PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-168963367 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-168963214 **[Test build #48754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48754/consoleFull)** for PR 9451 at commit [`cd8ca34`](https://github.com/apache/spark/commit/cd8ca343019d1e7a2a43128ea070f9cda828dc81). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-168963322 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48748/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48830059 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobSet.scala --- @@ -59,17 +59,15 @@ case class JobSet( // Time taken to process all the jobs from the time they were submitted // (i.e. including the time they wait in the streaming scheduler queue) - def totalDelay: Long = { -processingEndTime - time.milliseconds - } + def totalDelay: Long = processingEndTime - time.milliseconds --- End diff -- Noted & thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10596#issuecomment-168967487 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12340][SQL]fix Int overflow in the Spar...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10562#issuecomment-168967549 @QiangCai I think the test failures are unrelated. However before we can retest you'll have to rebase as there is a merge conflict now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10596#issuecomment-168967490 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48755/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12570] [ML] [Doc] DecisionTreeRegressor...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10594#issuecomment-168935501 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48752/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10590#issuecomment-168935787 **[Test build #2321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2321/consoleFull)** for PR 10590 at commit [`ffb9fb0`](https://github.com/apache/spark/commit/ffb9fb001b2fe848a7fb4ca4f250dbe206bae0e4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12570] [ML] [Doc] DecisionTreeRegressor...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/10594#discussion_r48822844 --- Diff: docs/ml-classification-regression.md --- @@ -535,7 +535,9 @@ The main differences between this API and the [original MLlib Decision Tree API] * use of DataFrame metadata to distinguish continuous and categorical features -The Pipelines API for Decision Trees offers a bit more functionality than the original API. In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities). +The Pipelines API for Decision Trees offers a bit more functionality than the original API. +In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities); --- End diff -- Line ends with ";". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10658][SPARK-11421][PYSPARK][CORE] Prov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9313#issuecomment-168936205 **[Test build #48745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48745/consoleFull)** for PR 9313 at commit [`bf3e98f`](https://github.com/apache/spark/commit/bf3e98f07097b21066fcd681c437998ce65a1379). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10658][SPARK-11421][PYSPARK][CORE] Prov...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9313#issuecomment-168936339 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...
Github user rahulpalamuttam commented on the pull request: https://github.com/apache/spark/pull/8960#issuecomment-168942081 @jkbradley @mengxr I have opened the new PR in Breeze here : https://github.com/scalanlp/breeze/pull/480 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10597 [SPARK-12645] [SparkR] SparkR support hash function Add ```hash``` function for SparkR ```DataFrame```. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-12645 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10597 commit c41eb1fd364c52d9eae0469229e0eb850c03c57a Author: Yanbo LiangDate: 2016-01-05T09:42:55Z SparkR support hash function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168952173 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168952844 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48828208 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala --- @@ -286,7 +286,7 @@ abstract class DStream[T: ClassTag] ( dependencies.foreach(_.validateAtStart()) logInfo("Slide time = " + slideDuration) --- End diff -- It'd be fine to make this all use interpolation while you're at it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48828188 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobSet.scala --- @@ -59,17 +59,15 @@ case class JobSet( // Time taken to process all the jobs from the time they were submitted // (i.e. including the time they wait in the streaming scheduler queue) - def totalDelay: Long = { -processingEndTime - time.milliseconds - } + def totalDelay: Long = processingEndTime - time.milliseconds def toBatchInfo: BatchInfo = { BatchInfo( time, streamIdToInputInfo, submissionTime, - if (processingStartTime >= 0) Some(processingStartTime) else None, - if (processingEndTime >= 0) Some(processingEndTime) else None, + if (hasStarted) Some(processingStartTime) else None, --- End diff -- These change the logic slightly -- are you sure it's equivalent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48828135 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobSet.scala --- @@ -59,17 +59,15 @@ case class JobSet( // Time taken to process all the jobs from the time they were submitted // (i.e. including the time they wait in the streaming scheduler queue) - def totalDelay: Long = { -processingEndTime - time.milliseconds - } + def totalDelay: Long = processingEndTime - time.milliseconds --- End diff -- Although I wouldn't bother with this kind of change, it's OK here IMHO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12638] [API DOC] Parameter explaination...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10587#issuecomment-168955492 I like this, but how about adding similar docs to other similar methods like treeAggregate, fold, etc? the semantics of fold were brought up just last week, for example. Your point about it being per-partition is quite pertinent there too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][MINOR] scaladoc fixes
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10591#issuecomment-168955273 **[Test build #2323 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2323/consoleFull)** for PR 10591 at commit [`a23cfcf`](https://github.com/apache/spark/commit/a23cfcf8375c132c8d79c3c0ead3d0c317966f16). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-168963368 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48754/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48830663 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobSet.scala --- @@ -59,17 +59,15 @@ case class JobSet( // Time taken to process all the jobs from the time they were submitted // (i.e. including the time they wait in the streaming scheduler queue) - def totalDelay: Long = { -processingEndTime - time.milliseconds - } + def totalDelay: Long = processingEndTime - time.milliseconds def toBatchInfo: BatchInfo = { BatchInfo( time, streamIdToInputInfo, submissionTime, - if (processingStartTime >= 0) Some(processingStartTime) else None, - if (processingEndTime >= 0) Some(processingEndTime) else None, + if (hasStarted) Some(processingStartTime) else None, --- End diff -- Tested it locally (and can't wait to see the results from Jenkins). The current code *overly* assumes that the times can be `0` (which cannot...ever). It is also more clearer that at `hasCompleted` `processingEndTime` is already set. It's over-complicated as it's now IMHO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168965051 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48747/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168965050 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10590#issuecomment-168935012 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11579] [ML] avoid creating new optimize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9614#issuecomment-168934988 **[Test build #48753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48753/consoleFull)** for PR 9614 at commit [`dcf0d8f`](https://github.com/apache/spark/commit/dcf0d8ff111cdb6812cb5ff74d0119331270b644). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10590#issuecomment-168935005 I've merged this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10590#issuecomment-168935015 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48740/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11579] [ML] avoid creating new optimize...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/9614#issuecomment-168935901 @avulanov Thanks for review. The only possible concern I got is LBFGSOptimizer/SGDOptimizer appears to be getter, yet they are setters. I sent an update which changes the function name only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-168937118 **[Test build #48754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48754/consoleFull)** for PR 9451 at commit [`cd8ca34`](https://github.com/apache/spark/commit/cd8ca343019d1e7a2a43128ea070f9cda828dc81). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12393] [SparkR] Add read.text and write...
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/10348#issuecomment-168938461 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168943889 **[Test build #48756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48756/consoleFull)** for PR 10577 at commit [`c1f66f7`](https://github.com/apache/spark/commit/c1f66f744fce35eb657f9ec8a971dbd5449d0985). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168951738 **[Test build #48751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48751/consoleFull)** for PR 9858 at commit [`dd2bdc8`](https://github.com/apache/spark/commit/dd2bdc8650e9db763ec3afe290919d8a15404e9d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][follow-up] use a single column v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10588#issuecomment-168952849 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48750/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] Scaladoc fixes...mostly
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10592#issuecomment-168955445 **[Test build #2324 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2324/consoleFull)** for PR 10592 at commit [`fa65c0d`](https://github.com/apache/spark/commit/fa65c0d69ca8ec97edb63353c34dfc5cdd04dacf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12567][SQL] Add aes_{encrypt,decrypt} U...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10527#issuecomment-168963051 **[Test build #48748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48748/consoleFull)** for PR 10527 at commit [`0558bf8`](https://github.com/apache/spark/commit/0558bf8b698e9de7e19625627e487bfb3f33072d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AesEncrypt(left: Expression, right: Expression)` * `case class AesDecrypt(left: Expression, right: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10597#issuecomment-168963887 **[Test build #48757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48757/consoleFull)** for PR 10597 at commit [`c41eb1f`](https://github.com/apache/spark/commit/c41eb1fd364c52d9eae0469229e0eb850c03c57a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10597#issuecomment-168963978 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48757/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10597#issuecomment-168963977 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] More contextual information...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/10595#discussion_r48830750 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala --- @@ -286,7 +286,7 @@ abstract class DStream[T: ClassTag] ( dependencies.foreach(_.validateAtStart()) logInfo("Slide time = " + slideDuration) --- End diff -- Thanks! I was thinking about it, but was worried to propose such changes as you might not like it :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12573][SPARK-12574][SQL] Move SQL Parse...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10583#issuecomment-168964867 **[Test build #48747 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48747/consoleFull)** for PR 10583 at commit [`fb3b4a4`](https://github.com/apache/spark/commit/fb3b4a4c461391866bc12a51dd1e60eadeaff916). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10596#issuecomment-168967352 **[Test build #48755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48755/consoleFull)** for PR 10596 at commit [`6460aca`](https://github.com/apache/spark/commit/6460aca95eccd1249d03546b9deeb90f3f5f02e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12641] Remove unused code related to Ha...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10590 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10593#issuecomment-168934389 **[Test build #48749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48749/consoleFull)** for PR 10593 at commit [`d99659d`](https://github.com/apache/spark/commit/d99659d89a7709df8223ab86b1edd244b1e63086). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org