Yeah, I wrote those lines a while back, I wanted to contrast storage levels with and without serialization. Should have realized that StorageLevel.MEMORY_ONLY_SER can be confused to be the default level.
TD On Wed, Jul 23, 2014 at 5:12 AM, Shao, Saisai <saisai.s...@intel.com> wrote: > Yeah, the document may not be precisely aligned with latest code, so the best > way is to check the code. > > -----Original Message----- > From: Haopu Wang [mailto:hw...@qilinsoft.com] > Sent: Wednesday, July 23, 2014 5:56 PM > To: user@spark.apache.org > Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl" > > Jerry, thanks for the response. > > For the default storage level of DStream, it looks like Spark's document is > wrong. In this link: > http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning > It mentions: > "Default persistence level of DStreams: Unlike RDDs, the default persistence > level of DStreams serializes the data in memory (that is, > StorageLevel.MEMORY_ONLY_SER for DStream compared to StorageLevel.MEMORY_ONLY > for RDDs). Even though keeping the data serialized incurs higher > serialization/deserialization overheads, it significantly reduces GC pauses." > > I will take a look at DStream.scala although I have no Scala experience. > > -----Original Message----- > From: Shao, Saisai [mailto:saisai.s...@intel.com] > Sent: 2014年7月23日 15:13 > To: user@spark.apache.org > Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl" > > Hi Haopu, > > Please see the inline comments. > > Thanks > Jerry > > -----Original Message----- > From: Haopu Wang [mailto:hw...@qilinsoft.com] > Sent: Wednesday, July 23, 2014 3:00 PM > To: user@spark.apache.org > Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl" > > I have a DStream receiving data from a socket. I'm using local mode. > I set "spark.streaming.unpersist" to "false" and leave " > spark.cleaner.ttl" to be infinite. > I can see files for input and shuffle blocks under "spark.local.dir" > folder and the size of folder keeps increasing, although JVM's memory usage > seems to be stable. > > [question] In this case, because input RDDs are persisted but they don't fit > into memory, so write to disk, right? And where can I see the details about > these RDDs? I don't see them in web UI. > > [answer] Yes, if memory is not enough to put input RDDs, this data will be > flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" > as you can see in StreamingContext.scala. Actually you cannot not see the > input RDD in web UI, you can only see the cached RDD in web UI. > > Then I set "spark.streaming.unpersist" to "true", the size of > "spark.local.dir" folder and JVM's used heap size are reduced regularly. > > [question] In this case, because I didn't change "spark.cleaner.ttl", which > component is doing the cleanup? And what's the difference if I set > "spark.cleaner.ttl" to some duration in this case? > > [answer] If you set "spark.streaming.unpersist" to true, old unused rdd will > be deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is > timer-based spark cleaner, not only clean streaming data, but also broadcast, > shuffle and other data. > > Thank you! >