Hi All, I am resending this message because I suspect the original may have been blocked from the mailing list due to attachments. Note that the mail does appear on the apache archives <http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3CCANR-kKeO3mxL1QuX0fnz0DEPkU4FFbXO2W_5CdmtrzYKUfhaBg%40mail.gmail.com%3E> but not on nabble, the online archive linked from the Spark website <http://apache-spark-user-list.1001560.n3.nabble.com/>.
The text of the original message appears below; the PDF <http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/raw/%3ccanr-kkeo3mxl1qux0fnz0depku4ffbxo2w_5cdmtrzykufh...@mail.gmail.com%3e/2> and PNG <http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/raw/%3ccanr-kkeo3mxl1qux0fnz0depku4ffbxo2w_5cdmtrzykufh...@mail.gmail.com%3e/3> files original attached are now available as linked from the apache archive. best, -Brad ---------- Forwarded message ---------- From: Brad Miller <[email protected]> Date: Mon, Jun 30, 2014 at 10:20 AM Subject: odd caching behavior or accounting To: [email protected] Hi All, I've recently noticed some caching behavior which I did not understand and may or may not have indicated a bug. In short, the web UI seemed to indicate that some blocks were being added to the cache despite already being in cache. As documentation, I have attached two UI screenshots. The PNG captures enough of the screen to demonstrate the problem; the PDF is the printout of the full page. Notice that: -block rdd_21_1001 is in the cache twice, both times on letang.research.intel-research.net; many other blocks also occur twice on a variety of hosts. I've not confirmed that the duplicate block is *always* the same host but it seems to appear that way. -the stated storage level is "Memory Deserialized 1x Replicated" -the top left states that the "cached partitions" and "total partitions" are 4000, but in the table where partitions are enumerated there are 4534. Although not reflected in this screenshot, I believe I have seen this behavior occur even when double caching of blocks causes eviction of blocks from other RDDs. I am running the Spark 1.0.0 release and using pyspark. best, -Brad
