Correct - that is the part that I understood nicely. However, what alternative transformation might I apply to iterate through the RDDs considering a window duration of 60 seconds which I cannot change?
> On 17 Mar 2017, at 16:57, Cody Koeninger <c...@koeninger.org> wrote: > > Probably easier if you show some more code, but if you just call > dstream.window(Seconds(60)) > you didn't specify a slide duration, so it's going to default to your > batch duration of 1 second. > So yeah, if you're just using e.g. foreachRDD to output every message > in the window, every second it's going to output the last 60 seconds > of messages... which does mean each message will be output a total of > 60 times. > > Using a smaller window of 5 seconds for an example, 1 message per > second, notice that message 1489765610 will be output a total of 5 > times > > Window: > 1489765605 > 1489765606 > 1489765607 > 1489765608 > 1489765609 > Window: > 1489765606 > 1489765607 > 1489765608 > 1489765609 > 1489765610 > Window: > 1489765607 > 1489765608 > 1489765609 > 1489765610 > 1489765611 > Window: > 1489765608 > 1489765609 > 1489765610 > 1489765611 > 1489765612 > Window: > 1489765609 > 1489765610 > 1489765611 > 1489765612 > 1489765613 > Window: > 1489765610 > 1489765611 > 1489765612 > 1489765613 > 1489765614 > Window: > 1489765611 > 1489765612 > 1489765613 > 1489765614 > 1489765615 > > On Thu, Mar 16, 2017 at 2:34 PM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> Hi all, >> >> As I’ve implemented a streaming application pulling data from Kafka every 1 >> second (batch interval), I am observing some quite strange behaviour (didn’t >> use Spark extensively in the past, but continuous operator based engines >> instead of). >> >> Namely the dstream.window(Seconds(60)) windowed stream when written back to >> Kafka contains more messages then they were consumed (for debugging purposes >> using a small dataset of a million Kafka byte array deserialized messages). >> In particular, in total I’ve streamed exactly 1 million messages, whereas >> upon window expiry 60 million messages are written back to Kafka. >> >> I’ve read on the official docs that both the window and window slide >> duration must be multiples of the batch interval. Does this mean that when >> consuming messages between two windows every batch interval the RDDs of a >> given batch interval t the same batch is being ingested 59 more times into >> the windowed stream? >> >> If I would like to achieve this behaviour (batch every being equal to a >> second, window duration 60 seconds) - how might one achieve this? >> >> I would appreciate if anyone could correct me if I got the internals of >> Spark’s windowed operations wrong and elaborate a bit. >> >> Thanks, >> Dominik --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org