RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

Shao, Saisai Wed, 23 Jul 2014 00:14:22 -0700

Hi Haopu, 

Please see the inline comments.

Thanks
Jerry

-----Original Message-----
From: Haopu Wang [mailto:hw...@qilinsoft.com] 
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"

I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory usage 
seems to be stable.

[question] In this case, because input RDDs are persisted but they don't fit 
into memory, so write to disk, right? And where can I see the details about 
these RDDs? I don't see them in web UI.

[answer] Yes, if memory is not enough to put input RDDs, this data will be 
flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" as 
you can see in StreamingContext.scala. Actually you cannot not see the input 
RDD in web UI, you can only see the cached RDD in web UI.

Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" 
folder and JVM's used heap size are reduced regularly.

[question] In this case, because I didn't change "spark.cleaner.ttl", which 
component is doing the cleanup? And what's the difference if I set 
"spark.cleaner.ttl" to some duration in this case?

[answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be 
deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is 
timer-based spark cleaner, not only clean streaming data, but also broadcast, 
shuffle and other data.

Thank you!

RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

Reply via email to