Re: Difference among batchDuration, windowDuration, slideDuration
I think hsy541 is still confused by what is still confusing to me. Namely, what is the value that sentence Each RDD in a DStream contains data from a certain interval is speaking of? This is from the Discretized Streams http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams section. The example makes it seem like the batchDuration is 4 seconds and then this mystery interval is 1 second? Where is this mystery interval defined? Or am i missing something altogether? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDuration-tp9966p22119.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Difference among batchDuration, windowDuration, slideDuration
Thanks Tathagata, so can I say RDD size(from the stream) is window size. and the overlap between 2 adjacent RDDs are sliding size. But I still don't understand what it batch size, why do we need this since data processing is RDD by RDD right? And does spark chop the data into RDDs at the very beginning? Do you allow event by event processing, for example filtering On Wed, Jul 16, 2014 at 6:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I guess this is better explained in the streaming programming guide's http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations window operation subsection. For completeness sake, its worth mentioning the following. Window operations can be applied on other windowed-DStreams as well. So the correct thing to say is that the slide duration of the window operations must be a multiple of sliding interval of the parent DStream. For simple, non-window dstream, this sliding interval is same as the batch interval // say batch interval is 2 seconds inputstream// moves every batch interval 2 seconds inputstream.window(Seconds(3)) // not allowed, must be multiple of 2 seconds inputstream.window(Seconds(4)) // allowed, moves every 2 seconds (therefore sliding interval is 2 seconds) inputstream.window(Seconds(10), Seconds(4))// allowed, moves every 4 seconds (therefore sliding interval is 4 seconds) inputstream.window(Seconds(10), Seconds(4)).window(Seconds(6))// not allowed, as window interval must be multiple of parent's sliding interval which is 4 seconds inputstream.window(Seconds(10), Seconds(4)).window(Seconds(8))// allowed Hopefully that made sense :) TD On Wed, Jul 16, 2014 at 12:41 PM, Walrus theCat walrusthe...@gmail.com wrote: I did not! On Wed, Jul 16, 2014 at 12:31 PM, aaronjosephs aa...@placeiq.com wrote: The only other thing to keep in mind is that window duration and slide duration have to be multiples of batch duration, IDK if you made that fully clear -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDuration-tp9966p9973.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Difference among batchDuration, windowDuration, slideDuration
When I'm reading the API of spark streaming, I'm confused by the 3 different durations StreamingContext(conf: SparkConf http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html , batchDuration: Duration http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/Duration.html ) DStream window(windowDuration: Duration http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/Duration.html , slideDuration: Duration http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/Duration.html ): DStream http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Can anyone please explain these 3 different durations Best, Siyuan
Re: Difference among batchDuration, windowDuration, slideDuration
The only other thing to keep in mind is that window duration and slide duration have to be multiples of batch duration, IDK if you made that fully clear -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDuration-tp9966p9973.html Sent from the Apache Spark User List mailing list archive at Nabble.com.