[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49261722
  
Eh  the binary checker is really failing me. Is there a way to disable 
binary checker for inner functions? @pwendell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15043414
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(reduceByKeyLocally() does not support 
array keys)
 }
 
-def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
+val reducePartition = (iter: Iterator[(K, V)]) = {
--- End diff --

this is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49266828
  
QA tests have started for PR 1450. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49269144
  
I created a JIRA to deal with this and did some initial exploration, but I 
think I'll need to wait for Prashant to actually do it:

https://issues.apache.org/jira/browse/SPARK-2549


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49350307
  
Merged in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1450


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1469

[SPARK-2534] Avoid pulling in the entire RDD in various operators 
(branch-1.0 backport)

This backports #1450 into branch-1.0.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark closure-1.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1469.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1469


commit b474a92d0cf051c6dc67ddfcc7423427ccd69020
Author: Reynold Xin r...@apache.org
Date:   2014-07-17T19:25:56Z

[SPARK-2534] Avoid pulling in the entire RDD in various operators




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1469#issuecomment-49353170
  
QA tests have started for PR 1469. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16787/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1469#issuecomment-49373136
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1469#issuecomment-49373309
  
QA tests have started for PR 1469. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16791/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1469#issuecomment-49376423
  
QA results for PR 1469:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16791/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1469#issuecomment-49379863
  
Merging in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/1469


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1450

[SPARK-2534] Avoid pulling in the entire RDD in groupByKey.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark agg-closure

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1450


commit 73b2783fef785941fc966ad32f2fd987b12447ae
Author: Reynold Xin r...@apache.org
Date:   2014-07-16T23:34:34Z

[SPARK-2534] Avoid pulling in the entire RDD in groupByKey.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49242499
  
Jenkins, why are you so slow 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15038170
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 // groupByKey shouldn't use map side combine because map side combine 
does not
 // reduce the amount of data shuffled and requires all map side data 
be inserted
 // into a hash table, leading to more objects in the old gen.
-def createCombiner(v: V) = ArrayBuffer(v)
--- End diff --

There appear to be ~6 other functions of this type (defs that may be passed 
into closures), could these also be problematic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15038311
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 // groupByKey shouldn't use map side combine because map side combine 
does not
 // reduce the amount of data shuffled and requires all map side data 
be inserted
 // into a hash table, leading to more objects in the old gen.
-def createCombiner(v: V) = ArrayBuffer(v)
--- End diff --

We should change all of them actually. I will update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49251632
  
Pushed a new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49251845
  
QA tests have started for PR 1450. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49256041
  
QA results for PR 1450:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15040306
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(reduceByKeyLocally() does not support 
array keys)
 }
 
-def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
+val reducePartition = (iter: Iterator[(K, V)]) = {
--- End diff --

I have to push back on the loss of the return type here, since I don't 
think it's obvious. I know it's kind of a pain to add the whole type 
specification, though... what would you think about putting a `: 
Iterator[JHashMap[K, V]]` after the final bracket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15040327
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(reduceByKeyLocally() does not support 
array keys)
 }
 
-def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
+val reducePartition = (iter: Iterator[(K, V)]) = {
--- End diff --

And when I said non-obvious, I mean just from looking at the function name 
and input arguments. Here it is actually straightforward to infer from the 
remaining lines, but in other situations it is less so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15040336
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(reduceByKeyLocally() does not support 
array keys)
 }
 
-def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
+val reducePartition = (iter: Iterator[(K, V)]) = {
--- End diff --

That makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49256830
  
QA tests have started for PR 1450. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49261240
  
QA results for PR 1450:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---