Re: subtractByKey increases RDD size in memory - any ideas?

DaPsul Fri, 19 Feb 2016 02:01:07 -0800

That could be possible but if you extract the data and create a new RDDthe size is still bigger:


val data = rdd3.collect()


val rdd4 = sc.paralellize(data)



Am 19/02/16 um 02:32 schrieb Andrew Ehrlich:

There could be clues in the different RDD subclasses; rdd1 isParallelCollectionRDD but rdd3 is SubtractedRDD.

On Thu, Feb 18, 2016 at 1:37 PM, DaPsul <dap...@gmx.de<mailto:dap...@gmx.de>> wrote:


    (copy from
    
http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size)

    I've found a very strange behavior for RDD's (spark 1.6.0 with
    scala 2.11):

    When i use subtractByKey on an RDD the resulting RDD should be of
    equal or
    smaller size. What i get is an RDD that takes even more space in
    memory:

    //Initialize first RDD
    val rdd1 = sc.parallelize(Array((1,1),(2,2),(3,3))).cache()

    //dummy action to cache it => size according to webgui: 184 Bytes
    rdd1.first

    //Initialize RDD to subtract (empty RDD should result in no change
    for rdd1)
    val rdd2 = sc.parallelize(Array[(Int,Int)]())

    //perform subtraction
    val rdd3 = rdd1.subtractByKey(rdd2).cache()

    //dummy action to cache rdd3 => size according to webgui: 208 Bytes
    rdd3.first

    I frist realized this strange behaviour for an RDD of ~200k rows
    and size
    1.3 GB that scaled up to more than 2 GB after subtraction

    Edit: Tried the example above with more values(10k) => same
    behaviour. The
    size increases by ~1.6 times. Also reduceByKey seems to have a similar
    effect.

    When i create an RDD by

    sc.paralellize(rdd3.collect())

    the size is the same as for rdd3, so the increased size carries
    over even if
    it's extracted from RDD.




    --
    View this message in context:
    
http://apache-spark-user-list.1001560.n3.nabble.com/subtractByKey-increases-RDD-size-in-memory-any-ideas-tp26272.html
    Sent from the Apache Spark User List mailing list archive at
    Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org
    <mailto:user-h...@spark.apache.org>

Re: subtractByKey increases RDD size in memory - any ideas?

Reply via email to