[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-64971765
  
I don't understand the architecture here as well as @rxin but this change 
seems like a strict improvement in its current form, so I'm gonna pull it in. 
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3378


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-24 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-64303243
  
@rxin Is it OK to merge? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63774979
  
We should definitely add a ClassTag since this can be used for primitive 
types. However, there might be places where we create a lot of CompactBuffers. 
I haven't had a chance to look at where CompactBuffers are used yet, but for 
those places, would it be possible to create a single ClassTag reference? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-20 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63775682
  
Cogroup uses `CompactBuffer`. However, it cannot add ClassTag due to its 
signature:

```Scala
class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], 
part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil)
```
Here `rdds` is `Seq[RDD[_ : Product2[K, _]]]` without the real template 
type of `RDD`s


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/3378

[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T]

Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create 
primitive arrays for primitive types. It will reduce the memory usage for 
primitive types significantly and only pay minor performance lost.

Here is my test code:
```Scala
  // Call org.apache.spark.util.SizeEstimator.estimate
  def estimateSize(obj: AnyRef): Long = {
val c = Class.forName(org.apache.spark.util.SizeEstimator$)
val f = c.getField(MODULE$)
val o = f.get(c)
val m = c.getMethod(estimate, classOf[Object])
m.setAccessible(true)
m.invoke(o, obj).asInstanceOf[Long]
  }

  sc.parallelize(1 to 1).groupBy(_ = 1).foreach {
case (k, v) =
  println(v.getClass() +  size:  + estimateSize(v))
  }
```

Using the previous CompactBuffer outputed
```
class org.apache.spark.util.collection.CompactBuffer size: 313358
```

Using the new CompactBuffer outputed
```
class org.apache.spark.util.collection.CompactBuffer size: 65712
```

In this case, the new `CompactBuffer` only used 20% memory of the previous 
one. It's really helpful for `groupByKey` when using a primitive value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-4505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3378


commit 4abdbba566aa776280cc20689fa901e17634e5fe
Author: zsxwing zsxw...@gmail.com
Date:   2014-11-20T01:35:27Z

Add a ClassTag parameter to reduce the memory usage of CompactBuffer[T] 
when T is a primitive type




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63765080
  
  [Test build #23661 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23661/consoleFull)
 for   PR 3378 at commit 
[`4abdbba`](https://github.com/apache/spark/commit/4abdbba566aa776280cc20689fa901e17634e5fe).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63765566
  
This seems like probably a great idea.  Do you know what the overhead of 
including a classtag is?  Does it mean an extra pointer per object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63766519
  
 Does it mean an extra pointer per object?

No. E.g., ClassTag.Int will be shared by all CompactBuffer[Int]. Same 
approach has already bean used in RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63769414
  
It's weird. I just found both the sizes of old and new `CompactBuffer(1)` 
are 56. I cannot explain why.

Then I added a field to the old CompactBuffer like this:
```Scala
class CompactBuffer[T] extends Seq[T] with Serializable {
  val dummy: AnyRef = null

  // First two elements
  private var element0: T = _
  private var element1: T = _
```
`println(estimateSize(CompactBuffer[Int](1)))` also outputs `56`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63769839
  
This does seem like a good change, though I'll note that I think groupBy is 
the only current user of this API that is able to have a primitive ClassTag. 
Still worthwhile, especially for future usage. I do wonder if it could have a 
runtime impact due to increased primitive wrapping, possibly creating a lot of 
short-lived garbage if it were iterated over many times.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63770652
  
  [Test build #23661 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23661/consoleFull)
 for   PR 3378 at commit 
[`4abdbba`](https://github.com/apache/spark/commit/4abdbba566aa776280cc20689fa901e17634e5fe).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63770658
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23661/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63770746
  
 It's weird. I just found both the sizes of old and new CompactBuffer(1) 
are 56.

Found the cause. My JVM enables `UseCompressedOops`. So in such case, due 
to alignment, the sizes are same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...

2014-11-19 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3378#issuecomment-63772021
  
My motivation is that we encountered a skew data set that a special hot key 
has too many values and could not fit into memory. Spilling helps nothing in 
this case since groupBy will put all values of a key into a CompactBuffer. 
After this optimization, at least, my job could run using the same memory 
limitation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org