[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1166

[SQL] Break hiveOperators.scala into multiple files.

The single file was getting very long (500+ loc). 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark hiveOperators

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1166


commit 5b430689aff95482c97b88b860d0734de459038c
Author: Reynold Xin r...@apache.org
Date:   2014-06-21T06:26:18Z

[SQL] Break hiveOperators.scala into multiple files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46746072
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46746074
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1167

[SPARK-2227] Support dfs command in SQL.

Note that nothing gets printed to the console because we don't properly 
maintain session state right now.

I will have a followup PR that fixes it.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark commands

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1167.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1167


commit 56f04f8f0ab5c6949f2a4bf776b449dea5b368cf
Author: Reynold Xin r...@apache.org
Date:   2014-06-21T06:27:58Z

[SPARK-2227] Support dfs command in SQL.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1167#issuecomment-46746547
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46747248
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46747249
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15982/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1167#issuecomment-46747769
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1167#issuecomment-46747770
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15983/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SparkSQL add SkewJoin

2014-06-21 Thread YanjieGao
Github user YanjieGao commented on the pull request:

https://github.com/apache/spark/pull/1134#issuecomment-46754360
  
Hi rxin,I reformat it . Can you give  me  some  suggestions.I will try to 
make it better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL basicOperators add Except operator

2014-06-21 Thread YanjieGao
Github user YanjieGao commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-46754425
  
Hi Zongheng, I try it ,and try add code like other operator. I don't know 
if i want to add this except operator ,do i need to add code or modify code in 
other scala files  ? Thanks a lot 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL add LeftSemiBloomFilterBroadcastJoin

2014-06-21 Thread YanjieGao
Github user YanjieGao commented on the pull request:

https://github.com/apache/spark/pull/1127#issuecomment-46754487
  
Hi  Zongheng, I reformat the code .I don't know if that is ok. And i  hope 
you can give me more suggestions . Thanks  a lot 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code

2014-06-21 Thread YanjieGao
Github user YanjieGao commented on the pull request:

https://github.com/apache/spark/pull/1115#issuecomment-46754558
  
Hi Srowen  ,  markhamstra .
I want to merge this to the master branch.Last time i make a mistake .  
I  resubmit this patch in   https://github.com/apache/spark/pull/1121
I don't know if this is right ?Can you give me some other suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code

2014-06-21 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1115#issuecomment-46755115
  
Assuming you have the `apache/spark` repository configured in your git 
repository as `upstream`, you can checkout your branch for this PR and `git 
pull --rebase upstream master`. This will try to apply your commits onto the 
latest code. You may have to resolve merge conflicts. Then `git push` your 
branch to update this PR.

If it's getting confusing, you can start over. Update your copy of 
`master`, make a new branch, and `apply` or `cherry-pick` your commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Branch 1.0 Add ZLIBCompressionCodec code

2014-06-21 Thread YanjieGao
Github user YanjieGao commented on the pull request:

https://github.com/apache/spark/pull/1115#issuecomment-46755267
  
Thanks a lot , I will do it as you said .I once submit it as  I fork spark 
reposity on the web ,and I write the code and run it on intellij .Then  i edit 
the scala file add the new code  on the web page . Then commit it  on the web 
page.  I don't know  i update code in this way is right or not ? Thanks !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-21 Thread tmalaska
Github user tmalaska commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46755792
  
I'm going to have to make a new pull request, because I had drop the repo 
that belonged to this pull request.  I will update the ticket with the 
information when it's ready


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-46759314
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-46759318
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-46760188
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15984/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-46760187
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1167#issuecomment-46761293
  
We do have a circular buffer that holds the command output already.  We 
could probably just add a command to clear it before each command and then 
optionally use it as the query result for these types of commands.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2227] Support dfs command in SQL.

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1167#issuecomment-46761327
  
This is currently failing for unrelated GraphX MIMA issues.

```
[info] spark-graphx: found 0 potential binary incompatibilities (filtered 
17)
[error]  * method partitions()java.util.List in trait 
org.apache.spark.api.java.JavaRDDLike does not have a correspondent in old 
version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.api.java.JavaRDDLike.partitions)
```

test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL add LeftSemiBloomFilterBroadcastJoin

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1127#discussion_r14051069
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala ---
@@ -245,6 +245,73 @@ case class LeftSemiJoinBNL(
   }
 }
 
+
+
+
+
+/**
+ * :: DeveloperApi ::
+ * LeftSemiBloomFilterBroadcastJoin
+ * Sometimes the semijoin's broadcast table can't fit memory.So  we can 
make it as Bloomfilter to  reduce the space
+ * and then broadcast it do the mapside  join
+ * The bloomfilter  use Shark's BloomFilter class implementation.
+ */
+@DeveloperApi
+case class LeftSemiJoinBFB(
+leftKeys: Seq[Expression],
--- End diff --

Indent 4 spaces.  Also I'd go with the full more descriptive name instead 
of BFB since we are only going to have to type it out in like 2 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL basicOperators add Except operator

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1151#discussion_r14051137
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
@@ -204,3 +204,18 @@ case class ExistingRdd(output: Seq[Attribute], rdd: 
RDD[Row]) extends LeafNode {
   override def execute() = rdd
 }
 
+/**
+ * :: DeveloperApi ::
+ * This operator support the substract function .Return an table with the 
elements from `this` that are not in `other`.
--- End diff --

Limit lines to 100 chars.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL basicOperators add Except operator

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1151#discussion_r14051147
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
@@ -204,3 +204,18 @@ case class ExistingRdd(output: Seq[Attribute], rdd: 
RDD[Row]) extends LeafNode {
   override def execute() = rdd
 }
 
+/**
+ * :: DeveloperApi ::
+ * This operator support the substract function .Return an table with the 
elements from `this` that are not in `other`.
+ */
+@DeveloperApi
+case class Except(children: Seq[SparkPlan])(@transient sc: SparkContext) 
extends SparkPlan {
--- End diff --

Maybe name this operators `Subtract`.  In general most of the 
catalyst/spark SQL operators are named after the relational operations they are 
performing.

If you aren't using `sc` I'd drop it and the `otherCopyArgs`.

Also, lets enforce the constraint from the TODO below using the typesystem. 
 Just have the operator be a BinaryNode with two children, `left` and `right`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark SQL basicOperators add Except operator

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-46761858
  
Thanks for working on this!

A few remaining tasks:
 - [ ] Add a new logical operator in `basicOperators.scala` in 
`catalyst/...`.
 - [ ] Hook that new logical operator into both parsers `HiveQl` and 
`SqlParser`.
 - [ ] Address the review comments.
 - [ ] Add a few tests in `SQLQuerySuite`
 - [ ] See if there are any new hive tests that we can whitelist in 
`HiveCompatibilitySuite` otherwise add a test in `HiveQuerySuite`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46761937
  
I'm going to merge this since only MIMA is failing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1166#issuecomment-46761957
  
Merged into master and 1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Break hiveOperators.scala into multiple ...

2014-06-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1166


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051239
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -124,4 +128,6 @@ class JoinedRow extends Row {
 }
 new GenericRow(copiedValues)
   }
+
+  override def toString() = s[JoinedRow][left:$row1][right:$row2]
--- End diff --

I think this should probably just print out like a normal row, but with the 
values from both sides.

How about just pulling these changes into their own PR since they are 
generally useful and we can merge that right away.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051257
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -81,6 +81,10 @@ class JoinedRow extends Row {
 this
   }
 
+  def setLeftRow(r: Row) { this.row1 = r }
--- End diff --

Maybe a more functional approach?
```scala
def withLeft(newLeft: Row): Row = {
  row1 = newLeft
  this
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051261
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -25,26 +25,6 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 
-/**
- * A pattern that matches any number of filter operations on top of 
another relational operator.
- * Adjacent filter operators are collected and their conditions are broken 
up and returned as a
- * sequence of conjunctive predicates.
- *
- * @return A tuple containing a sequence of conjunctive predicates that 
should be used to filter the
- * output and a relational operator.
- */
-object FilteredOperation extends PredicateHelper {
--- End diff --

Why are you deleting this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051271
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -114,48 +94,27 @@ object HashFilteredJoin extends Logging with 
PredicateHelper {
 (JoinType, Seq[Expression], Seq[Expression], Option[Expression], 
LogicalPlan, LogicalPlan)
 
   def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
-// All predicates can be evaluated for inner join (i.e., those that 
are in the ON
-// clause and WHERE clause.)
-case FilteredOperation(predicates, join @ Join(left, right, Inner, 
condition)) =
-  logger.debug(sConsidering hash inner join on: ${predicates ++ 
condition})
-  splitPredicates(predicates ++ condition, join)
-// All predicates can be evaluated for left semi join (those that are 
in the WHERE
-// clause can only from left table, so they can all be pushed down.)
-case FilteredOperation(predicates, join @ Join(left, right, LeftSemi, 
condition)) =
-  logger.debug(sConsidering hash left semi join on: ${predicates ++ 
condition})
-  splitPredicates(predicates ++ condition, join)
 case join @ Join(left, right, joinType, condition) =
   logger.debug(sConsidering hash join on: $condition)
-  splitPredicates(condition.toSeq, join)
-case _ = None
-  }
-
-  // Find equi-join predicates that can be evaluated before the join, and 
thus can be used
-  // as join keys.
-  def splitPredicates(allPredicates: Seq[Expression], join: Join): 
Option[ReturnType] = {
-val Join(left, right, joinType, _) = join
-val (joinPredicates, otherPredicates) =
-  allPredicates.flatMap(splitConjunctivePredicates).partition {
+  // Find equi-join predicates that can be evaluated before the join, 
and thus can be used
+  // as join keys.
+  val (joinPredicates, otherPredicates) = 
condition.map(splitConjunctivePredicates).
+getOrElse(Nil).partition {
 case Equals(l, r) if (canEvaluate(l, left)  canEvaluate(r, 
right)) ||
   (canEvaluate(l, right)  canEvaluate(r, left)) = true
 case _ = false
   }
 
-val joinKeys = joinPredicates.map {
-  case Equals(l, r) if canEvaluate(l, left)  canEvaluate(r, right) 
= (l, r)
-  case Equals(l, r) if canEvaluate(l, right)  canEvaluate(r, left) 
= (r, l)
-}
+  val joinKeys = joinPredicates.map {
+case Equals(l, r) if canEvaluate(l, left)  canEvaluate(r, right) 
= (l, r)
+case Equals(l, r) if canEvaluate(l, right)  canEvaluate(r, left) 
= (r, l)
+  }
 
-// Do not consider this strategy if there are no join keys.
--- End diff --

Why are you changing the semantics of this pattern?  It is called 
`HashFilteredJoin` but is now matching joins that cannot be answered using 
hashing techniques.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051287
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -29,30 +29,25 @@ import org.apache.spark.sql.columnar.{InMemoryRelation, 
InMemoryColumnarTableSca
 private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   self: SQLContext#SparkPlanner =
 
-  object LeftSemiJoin extends Strategy with PredicateHelper {
+  object JoinOperatorSelection extends Strategy with PredicateHelper {
+// put all of the join strategy here, since the match ordering is 
quite critical for
--- End diff --

Putting all of the join types in a single strategy means that we will never 
consider multiple ways to execute a given join.  The whole point of strategies 
is that we can eventually add cost based optimizations as part of the 
QueryPlanner infrastructure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051296
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala ---
@@ -36,158 +37,211 @@ case object BuildLeft extends BuildSide
 case object BuildRight extends BuildSide
 
 /**
- * :: DeveloperApi ::
+ * Output the tuples for the matched (with the same join key) join groups, 
accordingly to join type
  */
-@DeveloperApi
-case class HashJoin(
-leftKeys: Seq[Expression],
-rightKeys: Seq[Expression],
-buildSide: BuildSide,
-left: SparkPlan,
-right: SparkPlan) extends BinaryNode {
+trait BinaryJoinNode extends BinaryNode {
+  self: Product =
 
-  override def outputPartitioning: Partitioning = left.outputPartitioning
+  val SINGLE_NULL_LIST = Seq[Row](null)
--- End diff --

Let's stick to `camelCase`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051297
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala ---
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.execution
 
 import scala.collection.mutable.{ArrayBuffer, BitSet}
+import scala.beans.BeanProperty
--- End diff --

Remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1147#discussion_r14051302
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala ---
@@ -249,12 +297,15 @@ case class LeftSemiJoinBNL(
  * :: DeveloperApi ::
  */
 @DeveloperApi
-case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends 
BinaryNode {
+case class CartesianProduct(left: SparkPlan, right: SparkPlan, 
--- End diff --

What was the wrong with the way this was before?  By duplicating the logic 
for doing filtering you are making this operator more complicated and now when 
we do things like codegen we are going to have to make changes to condition 
evaluation in two places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]HashJoin(Shuffled)

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-46762746
  
I think it would be much better if this PR just added support for LeftOuter 
(and maybe RightOuter too?) to HashJoin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread tmalaska
GitHub user tmalaska opened a pull request:

https://github.com/apache/spark/pull/1168

SPARK-1478.2: Upgrade FlumeInputDStream's FlumeReceiver to support 
FLUME-1915

SPARK-1478.2: Upgrade FlumeInputDStream's FlumeReceiver to support
FLUME-1915

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tmalaska/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1168


commit 12617e51c6f9fbbcf1b21db2cdcda2f7594b10d1
Author: tmalaska ted.mala...@cloudera.com
Date:   2014-06-21T20:03:58Z

SPARK-1478: Upgrade FlumeInputDStream's Flume...

SPARK-1478: Upgrade FlumeInputDStream's FlumeReceiver to support
FLUME-1915




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-21 Thread tmalaska
Github user tmalaska commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46763419
  
New Pull request https://github.com/apache/spark/pull/1168


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46763507
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46763617
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46763622
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46763649
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46763650
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15985/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-ec2: quote command line args

2014-06-21 Thread orikremer
GitHub user orikremer opened a pull request:

https://github.com/apache/spark/pull/1169

spark-ec2: quote command line args

To preserve quoted command line args (in case options have space in them).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orikremer/spark quote_cmd_line_args

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1169


commit 67e2aa1c7f945ff43a5b2b092f5cb25904f92265
Author: Ori Kremer ori.kre...@gmail.com
Date:   2014-06-21T20:28:34Z

quote command line args




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-ec2: quote command line args

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1169#issuecomment-46763973
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1163#discussion_r14051518
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers {
   }
 }
 
+
--- End diff --

Extra spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1800 Add broadcast hash join opera...

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/734#issuecomment-46764177
  
Closing in favor of: #1163


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1800 Add broadcast hash join opera...

2014-06-21 Thread marmbrus
Github user marmbrus closed the pull request at:

https://github.com/apache/spark/pull/734


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...

2014-06-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1163#discussion_r14051536
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -45,8 +45,8 @@ class Projection(expressions: Seq[Expression]) extends 
(Row = Row) {
  * that schema.
  *
  * In contrast to a normal projection, a MutableProjection reuses the same 
underlying row object
- * each time an input row is added.  This significatly reduces the cost of 
calcuating the
- * projection, but means that it is not safe
+ * each time an input row is added.  This significantly reduces the cost 
of calculating the
+ * projection, but means that it is not safe ...?
--- End diff --

... to hold on to a reference to a `Row` after `next()` has been called on 
the `Iterator` that produced it.  Instead, the user must call `Row.copy()` and 
hold on to the returned `Row` before calling `next()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-ec2: quote command line args

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1169#issuecomment-46764293
  
Jenkins, test this please. Thanks this LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46764347
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46764340
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46764649
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46764653
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SQL] SPARK-1800 Add broadcast hash join...

2014-06-21 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1163#issuecomment-46764715
  
Regarding testing we will probably want to pull all of our various join 
tests out into a separate test suite that can be run with various options 
turned on an off so we exercise all of the edge cases for each of the join 
operators.  This is going to become more important as we add more and more join 
types so I think its worth putting some time into it.

Towards that we might consider breaking this PR into a few pieces.  Get the 
new join type / testing in soon.  Add the auto selection / cost estimation in a 
follow up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-ec2: quote command line args

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1169#issuecomment-46765273
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-ec2: quote command line args

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1169#issuecomment-46765274
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15986/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46765270
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765275
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15987/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46765330
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46765328
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46765353
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765537
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15988/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765536
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765859
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765864
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765886
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46765887
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15990/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1170

SPARK-1996. Remove use of special Maven repo for Akka

Just following up Matei's suggestion to remove the Akka repo references. 
Builds and the audit-release script appear OK.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-1996

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1170


commit 5ca2930ccb7485a3037fa9bac3a5a4b996385167
Author: Sean Owen so...@cloudera.com
Date:   2014-06-21T22:05:56Z

Remove outdated Akka repository references




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766169
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766174
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766197
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15991/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1171

SPARK-1675. Make clear whether computePrincipalComponents requires centered 
data

Just closing out this small JIRA, resolving with a comment change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-1675

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1171.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1171


commit 45ee9b7cccf8ecb25647df5d2deb819caddab26a
Author: Sean Owen so...@cloudera.com
Date:   2014-06-21T22:10:47Z

Add simple note that data need not be centered for 
computePrincipalComponents




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766196
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766279
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766302
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766283
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766303
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15992/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46766392
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46766386
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46766437
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766456
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46766447
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766462
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766505
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1996. Remove use of special Maven repo f...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1170#issuecomment-46766513
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46766508
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46766509
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766514
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1675. Make clear whether computePrincipa...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1171#issuecomment-46766504
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-46766516
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777][WIP] Prevent OOMs from single par...

2014-06-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-46766515
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...

2014-06-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1064#discussion_r14051999
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleReader.scala ---
@@ -31,10 +31,24 @@ class HashShuffleReader[K, C](
   require(endPartition == startPartition + 1,
 Hash shuffle currently only supports fetching one partition)
 
+  private val dep = handle.dependency
+
   /** Read the combined key-values for this reduce task */
   override def read(): Iterator[Product2[K, C]] = {
-BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, 
context,
-  Serializer.getSerializer(handle.dependency.serializer))
+val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, 
startPartition, context,
+  Serializer.getSerializer(dep.serializer))
+
+if (dep.aggregator.isDefined) {
+  if (dep.mapSideCombine) {
+dep.aggregator.get.combineCombinersByKey(iter, context)
+  } else {
+dep.aggregator.get.combineValuesByKey(iter, context)
--- End diff --

So the one problem I see is that the InterruptibleIterator around these 
calls was lost when you moved them here. This is not great because it means 
tasks running these won't be cancelable. Can you add it back? You already have 
a TaskContext as a field of ShuffleReader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...

2014-06-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1064#discussion_r14052024
  
--- Diff: core/src/test/scala/org/apache/spark/ShuffleSuite.scala ---
@@ -78,8 +81,11 @@ class ShuffleSuite extends FunSuite with Matchers with 
LocalSparkContext {
 }
 // If the Kryo serializer is not used correctly, the shuffle would 
fail because the
 // default Java serializer cannot handle the non serializable class.
-val c = new ShuffledRDD[Int, NonJavaSerializableClass, (Int, 
NonJavaSerializableClass)](
-  b, new HashPartitioner(3)).setSerializer(new KryoSerializer(conf))
+val c = new ShuffledRDD[Int,
+  NonJavaSerializableClass,
+  NonJavaSerializableClass,
+  (Int, NonJavaSerializableClass)](b, new HashPartitioner(3))
+  .setSerializer(new KryoSerializer(conf))
--- End diff --

Probably should split out the call to setSerializer into a new statement 
instead of chaining it. (Just do `c.setSerializer(...)`.)
https://github.com/apache/spark/pull/1064/files#


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...

2014-06-21 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1064#discussion_r14052020
  
--- Diff: core/src/test/scala/org/apache/spark/ShuffleSuite.scala ---
@@ -56,8 +56,11 @@ class ShuffleSuite extends FunSuite with Matchers with 
LocalSparkContext {
 }
 // If the Kryo serializer is not used correctly, the shuffle would 
fail because the
 // default Java serializer cannot handle the non serializable class.
-val c = new ShuffledRDD[Int, NonJavaSerializableClass, (Int, 
NonJavaSerializableClass)](
-  b, new HashPartitioner(NUM_BLOCKS)).setSerializer(new 
KryoSerializer(conf))
+val c = new ShuffledRDD[Int,
+  NonJavaSerializableClass,
+  NonJavaSerializableClass,
+  (Int, NonJavaSerializableClass)](b, new HashPartitioner(NUM_BLOCKS))
--- End diff --

Probably should split out the call to setSerializer into a new statement 
instead of chaining it. (Just do `c.setSerializer(...)`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2124] Move aggregation into shuffle imp...

2014-06-21 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1064#issuecomment-46767244
  
Hey Saisai, I noticed one thing that got lost in the move, which is the use 
of InterruptibleIterator. We need to bring that back to allow cancellation of 
reduce tasks. Other than that it looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478.2: Upgrade FlumeInputDStream's Flum...

2014-06-21 Thread tmalaska
Github user tmalaska commented on the pull request:

https://github.com/apache/spark/pull/1168#issuecomment-46767307
  
Thanks tdas I messed that one.  I just updated.  It should be good now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1112, 2156] (1.0 edition) Use correct a...

2014-06-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1172#issuecomment-46767349
  
@mengxr - do you mind reviewing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >