Have you considered trying event time aggregation in structured streaming instead?
On Thu, Mar 16, 2017 at 12:34 PM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Hi all, > > As I’ve implemented a streaming application pulling data from Kafka every > 1 second (batch interval), I am observing some quite strange behaviour > (didn’t use Spark extensively in the past, but continuous operator based > engines instead of). > > Namely the dstream.window(Seconds(60)) windowed stream when written back > to Kafka contains more messages then they were consumed (for debugging > purposes using a small dataset of a million Kafka byte array deserialized > messages). In particular, in total I’ve streamed exactly 1 million > messages, whereas upon window expiry 60 million messages are written back > to Kafka. > > I’ve read on the official docs that both the window and window slide > duration must be multiples of the batch interval. Does this mean that when > consuming messages between two windows every batch interval the RDDs of a > given batch interval *t* the same batch is being ingested 59 more times > into the windowed stream? > > If I would like to achieve this behaviour (batch every being equal to a > second, window duration 60 seconds) - how might one achieve this? > > I would appreciate if anyone could correct me if I got the internals of > Spark’s windowed operations wrong and elaborate a bit. > > Thanks, > Dominik >