[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1166 [SQL] Break hiveOperators.scala into multiple files. The single file was getting very long (500+ loc). You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark hiveOperators Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1166 commit 5b430689aff95482c97b88b860d0734de459038c Author: Reynold Xin r...@apache.org Date: 2014-06-21T06:26:18Z [SQL] Break hiveOperators.scala into multiple files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46746072 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46746074 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1167 [SPARK-2227] Support dfs command in SQL. Note that nothing gets printed to the console because we don't properly maintain session state right now. I will have a followup PR that fixes it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark commands Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1167 commit 56f04f8f0ab5c6949f2a4bf776b449dea5b368cf Author: Reynold Xin r...@apache.org Date: 2014-06-21T06:27:58Z [SPARK-2227] Support dfs command in SQL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1167#issuecomment-46746547 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46747248 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46747249 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15982/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1167#issuecomment-46747769 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1167#issuecomment-46747770 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15983/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SparkSQL add SkewJoin
Github user YanjieGao commented on the pull request: https://github.com/apache/spark/pull/1134#issuecomment-46754360 Hi rxin,I reformat it . Can you give me some suggestions.I will try to make it better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL basicOperators add Except operator
Github user YanjieGao commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-46754425 Hi Zongheng, I try it ,and try add code like other operator. I don't know if i want to add this except operator ,do i need to add code or modify code in other scala files ? Thanks a lot --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL add LeftSemiBloomFilterBroadcastJoin
Github user YanjieGao commented on the pull request: https://github.com/apache/spark/pull/1127#issuecomment-46754487 Hi Zongheng, I reformat the code .I don't know if that is ok. And i hope you can give me more suggestions . Thanks a lot --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code
Github user YanjieGao commented on the pull request: https://github.com/apache/spark/pull/1115#issuecomment-46754558 Hi Srowen , markhamstra . I want to merge this to the master branch.Last time i make a mistake . I resubmit this patch in https://github.com/apache/spark/pull/1121 I don't know if this is right ?Can you give me some other suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1115#issuecomment-46755115 Assuming you have the `apache/spark` repository configured in your git repository as `upstream`, you can checkout your branch for this PR and `git pull --rebase upstream master`. This will try to apply your commits onto the latest code. You may have to resolve merge conflicts. Then `git push` your branch to update this PR. If it's getting confusing, you can start over. Update your copy of `master`, make a new branch, and `apply` or `cherry-pick` your commits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code
Github user YanjieGao commented on the pull request: https://github.com/apache/spark/pull/1115#issuecomment-46755267 Thanks a lot , I will do it as you said .I once submit it as I fork spark reposity on the web ,and I write the code and run it on intellij .Then i edit the scala file add the new code on the web page . Then commit it on the web page. I don't know i update code in this way is right or not ? Thanks ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...
Github user tmalaska commented on the pull request: https://github.com/apache/spark/pull/566#issuecomment-46755792 I'm going to have to make a new pull request, because I had drop the repo that belonged to this pull request. I will update the ticket with the information when it's ready --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Feat kryo max buffersize
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46759314 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Feat kryo max buffersize
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46759318 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Feat kryo max buffersize
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46760188 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15984/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Feat kryo max buffersize
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46760187 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1167#issuecomment-46761293 We do have a circular buffer that holds the command output already. We could probably just add a command to clear it before each command and then optionally use it as the query result for these types of commands. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1167#issuecomment-46761327 This is currently failing for unrelated GraphX MIMA issues. ``` [info] spark-graphx: found 0 potential binary incompatibilities (filtered 17) [error] * method partitions()java.util.List in trait org.apache.spark.api.java.JavaRDDLike does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.api.java.JavaRDDLike.partitions) ``` test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL add LeftSemiBloomFilterBroadcastJoin
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1127#discussion_r14051069 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala --- @@ -245,6 +245,73 @@ case class LeftSemiJoinBNL( } } + + + + +/** + * :: DeveloperApi :: + * LeftSemiBloomFilterBroadcastJoin + * Sometimes the semijoin's broadcast table can't fit memory.So we can make it as Bloomfilter to reduce the space + * and then broadcast it do the mapside join + * The bloomfilter use Shark's BloomFilter class implementation. + */ +@DeveloperApi +case class LeftSemiJoinBFB( +leftKeys: Seq[Expression], --- End diff -- Indent 4 spaces. Also I'd go with the full more descriptive name instead of BFB since we are only going to have to type it out in like 2 places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL basicOperators add Except operator
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1151#discussion_r14051137 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala --- @@ -204,3 +204,18 @@ case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode { override def execute() = rdd } +/** + * :: DeveloperApi :: + * This operator support the substract function .Return an table with the elements from `this` that are not in `other`. --- End diff -- Limit lines to 100 chars. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL basicOperators add Except operator
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1151#discussion_r14051147 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala --- @@ -204,3 +204,18 @@ case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode { override def execute() = rdd } +/** + * :: DeveloperApi :: + * This operator support the substract function .Return an table with the elements from `this` that are not in `other`. + */ +@DeveloperApi +case class Except(children: Seq[SparkPlan])(@transient sc: SparkContext) extends SparkPlan { --- End diff -- Maybe name this operators `Subtract`. In general most of the catalyst/spark SQL operators are named after the relational operations they are performing. If you aren't using `sc` I'd drop it and the `otherCopyArgs`. Also, lets enforce the constraint from the TODO below using the typesystem. Just have the operator be a BinaryNode with two children, `left` and `right`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark SQL basicOperators add Except operator
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-46761858 Thanks for working on this! A few remaining tasks: - [ ] Add a new logical operator in `basicOperators.scala` in `catalyst/...`. - [ ] Hook that new logical operator into both parsers `HiveQl` and `SqlParser`. - [ ] Address the review comments. - [ ] Add a few tests in `SQLQuerySuite` - [ ] See if there are any new hive tests that we can whitelist in `HiveCompatibilitySuite` otherwise add a test in `HiveQuerySuite`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46761937 I'm going to merge this since only MIMA is failing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1166#issuecomment-46761957 Merged into master and 1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1166 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051239 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala --- @@ -124,4 +128,6 @@ class JoinedRow extends Row { } new GenericRow(copiedValues) } + + override def toString() = s[JoinedRow][left:$row1][right:$row2] --- End diff -- I think this should probably just print out like a normal row, but with the values from both sides. How about just pulling these changes into their own PR since they are generally useful and we can merge that right away. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051257 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala --- @@ -81,6 +81,10 @@ class JoinedRow extends Row { this } + def setLeftRow(r: Row) { this.row1 = r } --- End diff -- Maybe a more functional approach? ```scala def withLeft(newLeft: Row): Row = { row1 = newLeft this } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051261 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -25,26 +25,6 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ -/** - * A pattern that matches any number of filter operations on top of another relational operator. - * Adjacent filter operators are collected and their conditions are broken up and returned as a - * sequence of conjunctive predicates. - * - * @return A tuple containing a sequence of conjunctive predicates that should be used to filter the - * output and a relational operator. - */ -object FilteredOperation extends PredicateHelper { --- End diff -- Why are you deleting this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051271 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -114,48 +94,27 @@ object HashFilteredJoin extends Logging with PredicateHelper { (JoinType, Seq[Expression], Seq[Expression], Option[Expression], LogicalPlan, LogicalPlan) def unapply(plan: LogicalPlan): Option[ReturnType] = plan match { -// All predicates can be evaluated for inner join (i.e., those that are in the ON -// clause and WHERE clause.) -case FilteredOperation(predicates, join @ Join(left, right, Inner, condition)) = - logger.debug(sConsidering hash inner join on: ${predicates ++ condition}) - splitPredicates(predicates ++ condition, join) -// All predicates can be evaluated for left semi join (those that are in the WHERE -// clause can only from left table, so they can all be pushed down.) -case FilteredOperation(predicates, join @ Join(left, right, LeftSemi, condition)) = - logger.debug(sConsidering hash left semi join on: ${predicates ++ condition}) - splitPredicates(predicates ++ condition, join) case join @ Join(left, right, joinType, condition) = logger.debug(sConsidering hash join on: $condition) - splitPredicates(condition.toSeq, join) -case _ = None - } - - // Find equi-join predicates that can be evaluated before the join, and thus can be used - // as join keys. - def splitPredicates(allPredicates: Seq[Expression], join: Join): Option[ReturnType] = { -val Join(left, right, joinType, _) = join -val (joinPredicates, otherPredicates) = - allPredicates.flatMap(splitConjunctivePredicates).partition { + // Find equi-join predicates that can be evaluated before the join, and thus can be used + // as join keys. + val (joinPredicates, otherPredicates) = condition.map(splitConjunctivePredicates). +getOrElse(Nil).partition { case Equals(l, r) if (canEvaluate(l, left) canEvaluate(r, right)) || (canEvaluate(l, right) canEvaluate(r, left)) = true case _ = false } -val joinKeys = joinPredicates.map { - case Equals(l, r) if canEvaluate(l, left) canEvaluate(r, right) = (l, r) - case Equals(l, r) if canEvaluate(l, right) canEvaluate(r, left) = (r, l) -} + val joinKeys = joinPredicates.map { +case Equals(l, r) if canEvaluate(l, left) canEvaluate(r, right) = (l, r) +case Equals(l, r) if canEvaluate(l, right) canEvaluate(r, left) = (r, l) + } -// Do not consider this strategy if there are no join keys. --- End diff -- Why are you changing the semantics of this pattern? It is called `HashFilteredJoin` but is now matching joins that cannot be answered using hashing techniques. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051287 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -29,30 +29,25 @@ import org.apache.spark.sql.columnar.{InMemoryRelation, InMemoryColumnarTableSca private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { self: SQLContext#SparkPlanner = - object LeftSemiJoin extends Strategy with PredicateHelper { + object JoinOperatorSelection extends Strategy with PredicateHelper { +// put all of the join strategy here, since the match ordering is quite critical for --- End diff -- Putting all of the join types in a single strategy means that we will never consider multiple ways to execute a given join. The whole point of strategies is that we can eventually add cost based optimizations as part of the QueryPlanner infrastructure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051296 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala --- @@ -36,158 +37,211 @@ case object BuildLeft extends BuildSide case object BuildRight extends BuildSide /** - * :: DeveloperApi :: + * Output the tuples for the matched (with the same join key) join groups, accordingly to join type */ -@DeveloperApi -case class HashJoin( -leftKeys: Seq[Expression], -rightKeys: Seq[Expression], -buildSide: BuildSide, -left: SparkPlan, -right: SparkPlan) extends BinaryNode { +trait BinaryJoinNode extends BinaryNode { + self: Product = - override def outputPartitioning: Partitioning = left.outputPartitioning + val SINGLE_NULL_LIST = Seq[Row](null) --- End diff -- Let's stick to `camelCase`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala --- @@ -18,6 +18,7 @@ package org.apache.spark.sql.execution import scala.collection.mutable.{ArrayBuffer, BitSet} +import scala.beans.BeanProperty --- End diff -- Remove. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1147#discussion_r14051302 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala --- @@ -249,12 +297,15 @@ case class LeftSemiJoinBNL( * :: DeveloperApi :: */ @DeveloperApi -case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode { +case class CartesianProduct(left: SparkPlan, right: SparkPlan, --- End diff -- What was the wrong with the way this was before? By duplicating the logic for doing filtering you are making this operator more complicated and now when we do things like codegen we are going to have to make changes to condition evaluation in two places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1147#issuecomment-46762746 I think it would be much better if this PR just added support for LeftOuter (and maybe RightOuter too?) to HashJoin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
GitHub user tmalaska opened a pull request: https://github.com/apache/spark/pull/1168 SPARK-1478.2: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 SPARK-1478.2: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tmalaska/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1168 commit 12617e51c6f9fbbcf1b21db2cdcda2f7594b10d1 Author: tmalaska ted.mala...@cloudera.com Date: 2014-06-21T20:03:58Z SPARK-1478: Upgrade FlumeInputDStream's Flume... SPARK-1478: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...
Github user tmalaska commented on the pull request: https://github.com/apache/spark/pull/566#issuecomment-46763419 New Pull request https://github.com/apache/spark/pull/1168 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46763507 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46763617 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46763622 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46763649 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46763650 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15985/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-ec2: quote command line args
GitHub user orikremer opened a pull request: https://github.com/apache/spark/pull/1169 spark-ec2: quote command line args To preserve quoted command line args (in case options have space in them). You can merge this pull request into a Git repository by running: $ git pull https://github.com/orikremer/spark quote_cmd_line_args Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1169.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1169 commit 67e2aa1c7f945ff43a5b2b092f5cb25904f92265 Author: Ori Kremer ori.kre...@gmail.com Date: 2014-06-21T20:28:34Z quote command line args --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-ec2: quote command line args
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1169#issuecomment-46763973 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1163#discussion_r14051518 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers { } } + --- End diff -- Extra spaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1800 Add broadcast hash join opera...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/734#issuecomment-46764177 Closing in favor of: #1163 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1800 Add broadcast hash join opera...
Github user marmbrus closed the pull request at: https://github.com/apache/spark/pull/734 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1163#discussion_r14051536 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala --- @@ -45,8 +45,8 @@ class Projection(expressions: Seq[Expression]) extends (Row = Row) { * that schema. * * In contrast to a normal projection, a MutableProjection reuses the same underlying row object - * each time an input row is added. This significatly reduces the cost of calcuating the - * projection, but means that it is not safe + * each time an input row is added. This significantly reduces the cost of calculating the + * projection, but means that it is not safe ...? --- End diff -- ... to hold on to a reference to a `Row` after `next()` has been called on the `Iterator` that produced it. Instead, the user must call `Row.copy()` and hold on to the returned `Row` before calling `next()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-ec2: quote command line args
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1169#issuecomment-46764293 Jenkins, test this please. Thanks this LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46764347 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46764340 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46764649 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46764653 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1163#issuecomment-46764715 Regarding testing we will probably want to pull all of our various join tests out into a separate test suite that can be run with various options turned on an off so we exercise all of the edge cases for each of the join operators. This is going to become more important as we add more and more join types so I think its worth putting some time into it. Towards that we might consider breaking this PR into a few pieces. Get the new join type / testing in soon. Add the auto selection / cost estimation in a follow up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-ec2: quote command line args
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1169#issuecomment-46765273 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-ec2: quote command line args
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1169#issuecomment-46765274 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15986/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46765270 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765275 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15987/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46765330 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46765328 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46765353 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765537 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15988/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765536 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765859 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765864 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765886 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46765887 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15990/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1170 SPARK-1996. Remove use of special Maven repo for Akka Just following up Matei's suggestion to remove the Akka repo references. Builds and the audit-release script appear OK. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-1996 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1170 commit 5ca2930ccb7485a3037fa9bac3a5a4b996385167 Author: Sean Owen so...@cloudera.com Date: 2014-06-21T22:05:56Z Remove outdated Akka repository references --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766169 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766174 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766197 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15991/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1171 SPARK-1675. Make clear whether computePrincipalComponents requires centered data Just closing out this small JIRA, resolving with a comment change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-1675 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1171.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1171 commit 45ee9b7cccf8ecb25647df5d2deb819caddab26a Author: Sean Owen so...@cloudera.com Date: 2014-06-21T22:10:47Z Add simple note that data need not be centered for computePrincipalComponents --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766196 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766279 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766302 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766283 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766303 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15992/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46766392 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46766386 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46766437 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766456 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46766447 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766462 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766505 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1170#issuecomment-46766513 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46766508 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46766509 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766514 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1171#issuecomment-46766504 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-46766516 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1165#issuecomment-46766515 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1064#discussion_r14051999 --- Diff: core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleReader.scala --- @@ -31,10 +31,24 @@ class HashShuffleReader[K, C]( require(endPartition == startPartition + 1, Hash shuffle currently only supports fetching one partition) + private val dep = handle.dependency + /** Read the combined key-values for this reduce task */ override def read(): Iterator[Product2[K, C]] = { -BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, - Serializer.getSerializer(handle.dependency.serializer)) +val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, + Serializer.getSerializer(dep.serializer)) + +if (dep.aggregator.isDefined) { + if (dep.mapSideCombine) { +dep.aggregator.get.combineCombinersByKey(iter, context) + } else { +dep.aggregator.get.combineValuesByKey(iter, context) --- End diff -- So the one problem I see is that the InterruptibleIterator around these calls was lost when you moved them here. This is not great because it means tasks running these won't be cancelable. Can you add it back? You already have a TaskContext as a field of ShuffleReader. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1064#discussion_r14052024 --- Diff: core/src/test/scala/org/apache/spark/ShuffleSuite.scala --- @@ -78,8 +81,11 @@ class ShuffleSuite extends FunSuite with Matchers with LocalSparkContext { } // If the Kryo serializer is not used correctly, the shuffle would fail because the // default Java serializer cannot handle the non serializable class. -val c = new ShuffledRDD[Int, NonJavaSerializableClass, (Int, NonJavaSerializableClass)]( - b, new HashPartitioner(3)).setSerializer(new KryoSerializer(conf)) +val c = new ShuffledRDD[Int, + NonJavaSerializableClass, + NonJavaSerializableClass, + (Int, NonJavaSerializableClass)](b, new HashPartitioner(3)) + .setSerializer(new KryoSerializer(conf)) --- End diff -- Probably should split out the call to setSerializer into a new statement instead of chaining it. (Just do `c.setSerializer(...)`.) https://github.com/apache/spark/pull/1064/files# --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1064#discussion_r14052020 --- Diff: core/src/test/scala/org/apache/spark/ShuffleSuite.scala --- @@ -56,8 +56,11 @@ class ShuffleSuite extends FunSuite with Matchers with LocalSparkContext { } // If the Kryo serializer is not used correctly, the shuffle would fail because the // default Java serializer cannot handle the non serializable class. -val c = new ShuffledRDD[Int, NonJavaSerializableClass, (Int, NonJavaSerializableClass)]( - b, new HashPartitioner(NUM_BLOCKS)).setSerializer(new KryoSerializer(conf)) +val c = new ShuffledRDD[Int, + NonJavaSerializableClass, + NonJavaSerializableClass, + (Int, NonJavaSerializableClass)](b, new HashPartitioner(NUM_BLOCKS)) --- End diff -- Probably should split out the call to setSerializer into a new statement instead of chaining it. (Just do `c.setSerializer(...)`.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1064#issuecomment-46767244 Hey Saisai, I noticed one thing that got lost in the move, which is the use of InterruptibleIterator. We need to bring that back to allow cancellation of reduce tasks. Other than that it looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...
Github user tmalaska commented on the pull request: https://github.com/apache/spark/pull/1168#issuecomment-46767307 Thanks tdas I messed that one. I just updated. It should be good now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1112, 2156] (1.0 edition) Use correct a...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1172#issuecomment-46767349 @mengxr - do you mind reviewing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---