Correct - that is the part that I understood nicely. 

However, what alternative transformation might I apply to iterate through the 
RDDs considering a window duration of 60 seconds which I cannot change?  

> On 17 Mar 2017, at 16:57, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Probably easier if you show some more code, but if you just call
> dstream.window(Seconds(60))
> you didn't specify a slide duration, so it's going to default to your
> batch duration of 1 second.
> So yeah, if you're just using e.g. foreachRDD to output every message
> in the window, every second it's going to output the last 60 seconds
> of messages... which does mean each message will be output a total of
> 60 times.
> 
> Using a smaller window of 5 seconds for an example, 1 message per
> second, notice that message 1489765610 will be output a total of 5
> times
> 
> Window:
> 1489765605
> 1489765606
> 1489765607
> 1489765608
> 1489765609
> Window:
> 1489765606
> 1489765607
> 1489765608
> 1489765609
> 1489765610
> Window:
> 1489765607
> 1489765608
> 1489765609
> 1489765610
> 1489765611
> Window:
> 1489765608
> 1489765609
> 1489765610
> 1489765611
> 1489765612
> Window:
> 1489765609
> 1489765610
> 1489765611
> 1489765612
> 1489765613
> Window:
> 1489765610
> 1489765611
> 1489765612
> 1489765613
> 1489765614
> Window:
> 1489765611
> 1489765612
> 1489765613
> 1489765614
> 1489765615
> 
> On Thu, Mar 16, 2017 at 2:34 PM, Dominik Safaric
> <dominiksafa...@gmail.com> wrote:
>> Hi all,
>> 
>> As I’ve implemented a streaming application pulling data from Kafka every 1
>> second (batch interval), I am observing some quite strange behaviour (didn’t
>> use Spark extensively in the past, but continuous operator based engines
>> instead of).
>> 
>> Namely the dstream.window(Seconds(60)) windowed stream when written back to
>> Kafka contains more messages then they were consumed (for debugging purposes
>> using a small dataset of a million Kafka byte array deserialized messages).
>> In particular, in total I’ve streamed exactly 1 million messages, whereas
>> upon window expiry 60 million messages are written back to Kafka.
>> 
>> I’ve read on the official docs that both the window and window slide
>> duration must be multiples of the batch interval. Does this mean that when
>> consuming messages between two windows every batch interval the RDDs of a
>> given batch interval t the same batch is being ingested 59 more times into
>> the windowed stream?
>> 
>> If I would like to achieve this behaviour (batch every being equal to a
>> second, window duration 60 seconds) - how might one achieve this?
>> 
>> I would appreciate if anyone could correct me if I got the internals of
>> Spark’s windowed operations wrong and elaborate a bit.
>> 
>> Thanks,
>> Dominik


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to