Re: Understanding the topology of high level kafka stream

Sachin Mittal Wed, 09 Nov 2016 07:30:06 -0800

Hi,
What happens when the message itself is purged by kafka via retention time
setting or something else, which was later than the last offset stored by
the stream consumer.


I am asking this because I am planning to keep the retention time for
internal changelog topics also small so no message gets big enough to start
getting exceptions.

So if messages from last offset are deleted then will there be any issues?

Also is there anyway to control or set the offset manually when we re start
the streaming application so certain old messages are not consumed at all
as logic wise they are not useful to streaming application any more. Like
say past users sessions created while streaming application was stopped.


Thanks
Sachin


On Wed, Nov 9, 2016 at 7:46 PM, Eno Thereska <[email protected]> wrote:

> Hi Sachin,
>
> Kafka Streams is built on top of standard Kafka consumers. For for every
> topic it consumes from (whether changelog topic or source topic, it doesn't
> matter), the consumer stores the offset it last consumed from. Upon
> restart, by default it start consuming from where it left off from each of
> the topics. So you can think of it this way: a restart should be no
> different than if you had left the application running (i.e., no restart).
>
> Thanks
> Eno
>
>
> > On 9 Nov 2016, at 13:59, Sachin Mittal <[email protected]> wrote:
> >
> > Hi,
> > I had some basic questions on sequence of tasks for streaming application
> > restart in case of failure or otherwise.
> >
> > Say my stream is structured this way
> >
> > source-topic
> >   branched into 2 kstreams
> >    source-topic-1
> >    source-topic-2
> >   each mapped to 2 new kstreams (new key,value pairs) backed by 2 kafka
> > topics
> >       source-topic-1-new
> >       source-topic-2-new
> >       each aggregated to new ktable backed by internal changelog topics
> >       source-topic-1-new-table (scource-topic-1-new-changelog)
> >       source-topic-2-new-table (scource-topic-2-new-changelog)
> >       table1 left join table2 -> to final stream
> > Results of final stream are then persisted into another data storage
> >
> > So if you see I have following physical topics or state stores
> > source-topic
> > source-topic-1-new
> > source-topic-2-new
> > scource-topic-1-new-changelog
> > scource-topic-2-new-changelog
> >
> > Now at a give point if the streaming application is stopped there is some
> > data in all these topics.
> > Barring the source-topic all other topic has data inserted by the
> streaming
> > application.
> >
> > Also I suppose streaming application stores the offset for each of the
> > topic as where it was last.
> >
> > So when I restart the application how does the processing starts again?
> > Will it pick the data from last left changelog topics and process them
> > first and then process the source topic data from the offset last left?
> >
> > Or it will start from source topic. I really don't want it to maintain
> > offset to changelog tables because any old key's value can be modified as
> > part of aggregation again.
> >
> > Bit confused here, any light would help a lot.
> >
> > Thanks
> > Sachin
>
>

Re: Understanding the topology of high level kafka stream

Reply via email to