[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-64971765 I don't understand the architecture here as well as @rxin but this change seems like a strict improvement in its current form, so I'm gonna pull it in. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3378 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-64303243 @rxin Is it OK to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63774979 We should definitely add a ClassTag since this can be used for primitive types. However, there might be places where we create a lot of CompactBuffers. I haven't had a chance to look at where CompactBuffers are used yet, but for those places, would it be possible to create a single ClassTag reference? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63775682 Cogroup uses `CompactBuffer`. However, it cannot add ClassTag due to its signature: ```Scala class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: Partitioner) extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) ``` Here `rdds` is `Seq[RDD[_ : Product2[K, _]]]` without the real template type of `RDD`s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/3378 [SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost. Here is my test code: ```Scala // Call org.apache.spark.util.SizeEstimator.estimate def estimateSize(obj: AnyRef): Long = { val c = Class.forName(org.apache.spark.util.SizeEstimator$) val f = c.getField(MODULE$) val o = f.get(c) val m = c.getMethod(estimate, classOf[Object]) m.setAccessible(true) m.invoke(o, obj).asInstanceOf[Long] } sc.parallelize(1 to 1).groupBy(_ = 1).foreach { case (k, v) = println(v.getClass() + size: + estimateSize(v)) } ``` Using the previous CompactBuffer outputed ``` class org.apache.spark.util.collection.CompactBuffer size: 313358 ``` Using the new CompactBuffer outputed ``` class org.apache.spark.util.collection.CompactBuffer size: 65712 ``` In this case, the new `CompactBuffer` only used 20% memory of the previous one. It's really helpful for `groupByKey` when using a primitive value. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-4505 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3378.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3378 commit 4abdbba566aa776280cc20689fa901e17634e5fe Author: zsxwing zsxw...@gmail.com Date: 2014-11-20T01:35:27Z Add a ClassTag parameter to reduce the memory usage of CompactBuffer[T] when T is a primitive type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63765080 [Test build #23661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23661/consoleFull) for PR 3378 at commit [`4abdbba`](https://github.com/apache/spark/commit/4abdbba566aa776280cc20689fa901e17634e5fe). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63765566 This seems like probably a great idea. Do you know what the overhead of including a classtag is? Does it mean an extra pointer per object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63766519 Does it mean an extra pointer per object? No. E.g., ClassTag.Int will be shared by all CompactBuffer[Int]. Same approach has already bean used in RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63769414 It's weird. I just found both the sizes of old and new `CompactBuffer(1)` are 56. I cannot explain why. Then I added a field to the old CompactBuffer like this: ```Scala class CompactBuffer[T] extends Seq[T] with Serializable { val dummy: AnyRef = null // First two elements private var element0: T = _ private var element1: T = _ ``` `println(estimateSize(CompactBuffer[Int](1)))` also outputs `56`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63769839 This does seem like a good change, though I'll note that I think groupBy is the only current user of this API that is able to have a primitive ClassTag. Still worthwhile, especially for future usage. I do wonder if it could have a runtime impact due to increased primitive wrapping, possibly creating a lot of short-lived garbage if it were iterated over many times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63770652 [Test build #23661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23661/consoleFull) for PR 3378 at commit [`4abdbba`](https://github.com/apache/spark/commit/4abdbba566aa776280cc20689fa901e17634e5fe). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63770658 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23661/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63770746 It's weird. I just found both the sizes of old and new CompactBuffer(1) are 56. Found the cause. My JVM enables `UseCompressedOops`. So in such case, due to alignment, the sizes are same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4505][Core] Add a ClassTag parameter to...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3378#issuecomment-63772021 My motivation is that we encountered a skew data set that a special hot key has too many values and could not fit into memory. Spilling helps nothing in this case since groupBy will put all values of a key into a CompactBuffer. After this optimization, at least, my job could run using the same memory limitation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org