vanzin opened a new pull request #25779: [SPARK-27468][core] Track correct 
storage level and mem/disk usage for RDDs.
URL: https://github.com/apache/spark/pull/25779
 
 
   Two things are being fixed here. The first, explicitly explained in the
   referenced bug, is the storage level tracked for RDDs and partitions.
   
   Previously, the RDD level would change depending on the status reported
   by executors for the block they were storing, and individual blocks would
   reflect that. That is wrong because different blocks may be stored 
differently
   in different executors.
   
   So now the RDD tracks the user-provided storage level, while the individual
   partitions reflect the current storage level of that particular block,
   including the current number of replicas.
   
   The second fix is in the accounting of usage: block managers report the
   current mem and disk used by the block, not the change from before. So
   the status listener needs to track the previous usage of the blocks so
   that it can accurately calculate the changes. This requires a bit more
   memory to be used in the driver, but tests show it's not that big of a
   problem (a few MB for a 100k-partition RDD with all blocks cached).
   
   Some internal accounting was changed to save some memory, given the extra
   usage incurred by the above tracking.
   
   For reference, mem usage comparison (captured using jvisualvm) for 100k 
entries:
   - Scala HashMap[String, LiveRDDBlock]: 17MB
   - Scala HashMap[Int, LiveRDDBlock]: 11MB
   - OpenHashMap[Int, LiveRDDBlock]: 6MB
   - OpenHashMap[String, LiveRDDBlock]: 14MB
   
   So using an OHM when you have primitive keys saves a lot of space. When you
   have non-primitive keys, the savings don't add up to much, so maps that need
   string keys were left untouched.
   
   The unit tests were also changed to reflect the actual behavior of the
   block manager when sending update events to the driver.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to