bs"d
I am new to the Spark Streaming and have some issues which i can't find any
documentation "stuff" to answer them

I believe a lot of Spark users in general and Spark Streaming in particular
use it for analysis of events by calculation of distributed large
aggregations. 
In case i have to "digest" a lot of events very fast and i perform some high
resolution (e.g. every 30 seconds) and also hourly aggregates. 

1. 
What happens when i take the DStream of RDDs generated by the 30 seconds
aggregates and call the countByValueAndWindow method with windowDuration =
60 minutes and slide Duration = 60 minutes. 

Will each added RDD to the DStream be added as soon as it is generated or
all the aggregation will be preformed after 60 minutes? 

If it is performed after 1 hour i guess it would be better to do the
periodic aggregates myself using foreachRDD ?

2. 
I believe i understood from the documentation that DStreams are by default
persisted and we should use checkpoint to "free" some memory
In case i have the hourly aggregates and would only like to store them is it
possible to free the DStreams without calling checkpoint which store the
data to disk and may be a bottle neck

3. 
Is it possible to make "checkpoint" to be written not as HDFS file but in
other format e.g. into Cassandra DB

I find almost no documentation referring to the Spark Streaming project and
hope someone which understands well the material will be able to shed some
light into the subject

Best
DD



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-API-and-Performance-Clarifications-tp12717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to