It's likely the Ints are getting boxed at some point along the journey (perhaps starting with parallelize()). I could definitely see boxed Ints being 7 times larger than primitive ones.
If you wanted to be very careful, you could try making an RDD[Array[Int]], where each element is simply a subset of your original array, and specifying one partition per element, effectively manually partitioning your data. I suspect you'd see the 7x overhead disappear. On Mon, Apr 14, 2014 at 7:07 PM, wxhsdp <wxh...@gmail.com> wrote: > Hi, all > in order to understand the memory usage about spark, i do the following > test > > val size = 1024*1024 > val array = new Array[Int](size) > > for(i <- 0 until size) { > array(i) = i > } > > val a = sc.parallelize(array).cache() /*4MB*/ > > val b = a.mapPartitions{ c => { > val d = c.toArray > > val e = new Array[Int](2*size) /*8MB*/ > val f = new Array[Int](2*size) /*8MB*/ > > for(i <- 0 until 2*size) { > e(i) = d(i % size) > f(i) = d((i+1) % size) > } > > (e++f).toIterator > }}.cache() > > when i compile and run in sbt, the estimated size of a and b is exactly 7 > times larger than the real size > > 14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values > to memory (estimated size 28.0 MB, free 862.9 MB) > 14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9 > MB) > > 14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values > to memory (estimated size 112.0 MB, free 750.9 MB) > 14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9 > MB) > > but when i try it in the spark shell, the estimated size is almost equal to > real size > > 14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to > memory > (estimated size 4.2 MB, free 292.7 MB) > 14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added > rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB) > > 14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to > memory > (estimated size 17.0 MB, free 275.8 MB) > 14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added > rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB) > > who knows the reason? > i'm really confused about memory use in spark. > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n4251/memory.png > > > > JVM and spark memory locate at different parts of system memory, the spark > code is executed in JVM memory, malloc operation like val e = new > Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs > are > writed back to disk, if cached, RDDs are copied to spark memory, is that > right? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >