It seems default value for spark.cleaner.delay is 3600 seconds but I need to
be able to count things on daily, weekly or even monthly based.

I suppose the aim of DStream batches and spark.cleaner.delay is to avoid
space issues (running out of memory etc.). I usually use HyperLogLog for
counting unique things to save space, and AFAIK, the other metrics are
simply long values which doesn't require much space.

When I start learning Spark Streaming it really confused me because in my
first "Hello World" example all I wanted is to count all events processed by
Spark Streaming. DStream batches are nice but when I need simple counting
operations it becomes complex. Since Spark Streaming creates new DStreams
for each interval, I needed to merge them in a single DStream so I used
updateStateByKey() to generate a StateDStream. I seems it works now but I'm
not sure whether it's efficient or not because I all need is a single global
counter but now Spark has counters for all 2 seconds intervals plus a global
counter for StateDStream.

I don't have any specific purpose like "Show me this type of unique things
for last 10 minutes", instead I need to be able to count things in a large
scale; it can be both 10 minutes or 1 month. I create pre-aggregation rules
on the fly and when I need simple monthly based counter, Spark seems
overkill to me for now.

Do you have any advice for me to use efficiently using Spark Streaming?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-advice-for-using-big-spark-cleaner-delay-value-in-Spark-Streaming-tp4895.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to