Marcelo Masiero Vanzin created SPARK-29319:
----------------------------------------------

             Summary: Memory and disk usage not accurate when blocks are 
evicted and re-loaded in memory
                 Key: SPARK-29319
                 URL: https://issues.apache.org/jira/browse/SPARK-29319
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4, 3.0.0
            Reporter: Marcelo Masiero Vanzin


I found this while running more targeted tests for the underlying behavior of 
this code, triggered by SPARK-27468. I ran this code:

{code}
import java.util.Arrays
import org.apache.spark.rdd._
import org.apache.spark.storage._

def newCachedRDD(level: StorageLevel): RDD[Array[Long]] = {
  val rdd = sc.parallelize(1 to 64, 64).map { i =>
    val a = new Array[Long](1024 * 1024)
    Arrays.fill(a, i)
    a
  }
  rdd.persist(level)
  rdd.count()
  rdd
}

val r1 = newCachedRDD(level = StorageLevel.MEMORY_AND_DISK)
val r2 = newCachedRDD(level = StorageLevel.MEMORY_ONLY)
{code}

With {{./bin/spark-shell --master 'local-cluster[1,1,1024]'}}.

After it runs, you end up with the expected values: r1 has everything cached, 
only using disk, because all its memory blocks were evicted by r2; r2 has as 
many blocks as the memory can hold.

The problem shows up when you start playing with those RDDs again.

Calling {{r1.count()}} will cause all of r2's blocks to be evicted, since r1's 
blocks are loaded back in memory. But no block update is sent to the driver 
about that load, so the driver does not know that the blocks are now in memory. 
The UI will show that r1 is using 0 bytes of memory, and r2 disappears from the 
storage page (this last one as expected).

Calling {{r2.count()}} after that will cause r1's blocks to be evicted again. 
This will send updates to the driver, which will now double-count the disk 
usage. So if you keep doing this back and forth, r1's disk usage will keep 
growing in the UI, when in fact it doesn't change at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to