It's likely the Ints are getting boxed at some point along the journey
(perhaps starting with parallelize()). I could definitely see boxed Ints
being 7 times larger than primitive ones.

If you wanted to be very careful, you could try making an RDD[Array[Int]],
where each element is simply a subset of your original array, and
specifying one partition per element, effectively manually partitioning
your data. I suspect you'd see the 7x overhead disappear.


On Mon, Apr 14, 2014 at 7:07 PM, wxhsdp <wxh...@gmail.com> wrote:

> Hi, all
> in order to understand the memory usage about spark, i do the following
> test
>
> val size = 1024*1024
> val array = new Array[Int](size)
>
> for(i <- 0 until size) {
> array(i) = i
> }
>
> val a = sc.parallelize(array).cache() /*4MB*/
>
> val b = a.mapPartitions{ c => {
>   val d = c.toArray
>
>   val e = new Array[Int](2*size) /*8MB*/
>   val f = new Array[Int](2*size) /*8MB*/
>
>   for(i <- 0 until 2*size) {
>     e(i) = d(i % size)
>     f(i) = d((i+1) % size)
>   }
>
>   (e++f).toIterator
> }}.cache()
>
> when i compile and run in sbt, the estimated size of a and b is exactly 7
> times larger than the real size
>
> 14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values
> to memory (estimated size 28.0 MB, free 862.9 MB)
> 14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9
> MB)
>
> 14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values
> to memory (estimated size 112.0 MB, free 750.9 MB)
> 14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9
> MB)
>
> but when i try it in the spark shell, the estimated size is almost equal to
> real size
>
> 14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to
> memory
> (estimated size 4.2 MB, free 292.7 MB)
> 14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added
> rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB)
>
> 14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to
> memory
> (estimated size 17.0 MB, free 275.8 MB)
> 14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added
> rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB)
>
> who knows the reason?
> i'm really confused about memory use in spark.
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n4251/memory.png
> >
>
> JVM and spark memory locate at different parts of system memory, the spark
> code is executed in JVM memory, malloc operation like val e = new
> Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs
> are
> writed back to disk, if cached, RDDs are copied to spark memory, is that
> right?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to