Probably easier if you show some more code, but if you just call dstream.window(Seconds(60)) you didn't specify a slide duration, so it's going to default to your batch duration of 1 second. So yeah, if you're just using e.g. foreachRDD to output every message in the window, every second it's going to output the last 60 seconds of messages... which does mean each message will be output a total of 60 times.
Using a smaller window of 5 seconds for an example, 1 message per second, notice that message 1489765610 will be output a total of 5 times Window: 1489765605 1489765606 1489765607 1489765608 1489765609 Window: 1489765606 1489765607 1489765608 1489765609 1489765610 Window: 1489765607 1489765608 1489765609 1489765610 1489765611 Window: 1489765608 1489765609 1489765610 1489765611 1489765612 Window: 1489765609 1489765610 1489765611 1489765612 1489765613 Window: 1489765610 1489765611 1489765612 1489765613 1489765614 Window: 1489765611 1489765612 1489765613 1489765614 1489765615 On Thu, Mar 16, 2017 at 2:34 PM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Hi all, > > As I’ve implemented a streaming application pulling data from Kafka every 1 > second (batch interval), I am observing some quite strange behaviour (didn’t > use Spark extensively in the past, but continuous operator based engines > instead of). > > Namely the dstream.window(Seconds(60)) windowed stream when written back to > Kafka contains more messages then they were consumed (for debugging purposes > using a small dataset of a million Kafka byte array deserialized messages). > In particular, in total I’ve streamed exactly 1 million messages, whereas > upon window expiry 60 million messages are written back to Kafka. > > I’ve read on the official docs that both the window and window slide > duration must be multiples of the batch interval. Does this mean that when > consuming messages between two windows every batch interval the RDDs of a > given batch interval t the same batch is being ingested 59 more times into > the windowed stream? > > If I would like to achieve this behaviour (batch every being equal to a > second, window duration 60 seconds) - how might one achieve this? > > I would appreciate if anyone could correct me if I got the internals of > Spark’s windowed operations wrong and elaborate a bit. > > Thanks, > Dominik --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org