Hi,
It would be whatever's left in the JVM. This is not explicitly controlled
by a fraction like storage or shuffle. However, the computation usually
doesn't need to use that much space. In my experience it's almost always
the caching or the aggregation during shuffles that's the most memory
intensive.
-Andrew
2015-07-21 13:47 GMT-07:00 wdbaruni :
> I am new to Spark and I understand that Spark divides the executor memory
> into the following fractions:
>
> *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
> .cache() and can be defined by setting spark.storage.memoryFraction
> (default
> 0.6)
>
> *Shuffle and aggregation buffers:* Which Spark uses to store shuffle
> outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
> output exceeds this fraction, then Spark will spill data to disk (default
> 0.2)
>
> *User code:* Spark uses this fraction to execute arbitrary user code
> (default 0.2)
>
> I am not mentioning the storage and shuffle safety fractions for
> simplicity.
>
> My question is, which memory fraction is Spark using to compute and
> transform RDDs that are not going to be persisted? For example:
>
> lines = sc.textFile("i am a big file.txt")
> count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
> 1)).reduceByKey(add)
> count.saveAsTextFile("output")
>
> Here Spark will not load the whole file at once and will partition the
> input
> file and do all these transformations per partition in a single stage.
> However, which memory fraction Spark will use to load the partitioned
> lines,
> compute flatMap() and map()?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>