vanzin opened a new pull request #25779: [SPARK-27468][core] Track correct storage level and mem/disk usage for RDDs. URL: https://github.com/apache/spark/pull/25779 Two things are being fixed here. The first, explicitly explained in the referenced bug, is the storage level tracked for RDDs and partitions. Previously, the RDD level would change depending on the status reported by executors for the block they were storing, and individual blocks would reflect that. That is wrong because different blocks may be stored differently in different executors. So now the RDD tracks the user-provided storage level, while the individual partitions reflect the current storage level of that particular block, including the current number of replicas. The second fix is in the accounting of usage: block managers report the current mem and disk used by the block, not the change from before. So the status listener needs to track the previous usage of the blocks so that it can accurately calculate the changes. This requires a bit more memory to be used in the driver, but tests show it's not that big of a problem (a few MB for a 100k-partition RDD with all blocks cached). Some internal accounting was changed to save some memory, given the extra usage incurred by the above tracking. For reference, mem usage comparison (captured using jvisualvm) for 100k entries: - Scala HashMap[String, LiveRDDBlock]: 17MB - Scala HashMap[Int, LiveRDDBlock]: 11MB - OpenHashMap[Int, LiveRDDBlock]: 6MB - OpenHashMap[String, LiveRDDBlock]: 14MB So using an OHM when you have primitive keys saves a lot of space. When you have non-primitive keys, the savings don't add up to much, so maps that need string keys were left untouched. The unit tests were also changed to reflect the actual behavior of the block manager when sending update events to the driver.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org