Re: Reliable way to purge data from Kafka topics

Vincent Maurin Fri, 25 May 2018 01:00:28 -0700

What is the end results done by your consumers ?
>From what I understand, having the need for no duplicates means that these
duplicates can show up somewhere ?


According your needs, you can also have consumers in the two DC consuming
from both. Then you don't have duplicate because a message is either
produced on one cluster or the other.
I would really avoid mirror makers here for this setup (it is the component
creating the duplicates if you consume from both clusters at the end)


On Fri, May 25, 2018 at 9:29 AM Shantanu Deshmukh <shantanu...@gmail.com>
wrote:

> Hi Vincent,
>
> Our producers are consumers are indeed local to Kafka cluster. When we
> switch DC everything switches. So when we are on backup producers and
> consumers on backup DC are active, everything on primary DC is stopped.
>
> Whatever data gets accumulated on backup DC needs to be reflected in
> primary DC. That's when we start reverse replication. And to clean up data
> replicated from primary to backup (before switch happened), we have to
> purge topics on backup Kafka cluster. And that is the challenge.
>
> On Fri, May 25, 2018 at 12:40 PM Vincent Maurin <vincent.mau...@glispa.com
> >
> wrote:
>
> > Hi Shantanu
> >
> > I am not sure the scenario you are describing is the best case. I would
> > more consider the problem in term of producers and consumers of the data.
> > Usually is a good practice to put your producer local to your kafka
> > cluster, so in your case, I would suggest you have producers in the main
> > and in the backup data center / region.
> > Then the question arise for your consumers and eventually your data
> storage
> > behing. If it is centralized in one place, in could be better to no use
> > mirror maker and have duplication of the consumer.
> >
> > So something looking more like a star schema, let me try some ascii art :
> >
> > Main DC :                Data storage/processing DC :
> > Producer --> Kafka   |    Consumer ---->  Data storage
> >                      |               /->
> > Backup DC :          |              /
> > Producer --> Kafka   |    Consumer /
> >
> > If you have an outage on the main, the backup can "deplace it" (maybe
> just
> > with a DNS switch or similar)
> > If you have an outage on your storage/processing part, messages will just
> > be stored in kafka the time your consumers are up again (plan enough disk
> > on kafka to conver your SLA)
> >
> > Best,
> >
> >
> >
> >
> > On Fri, May 25, 2018 at 9:00 AM Jörn Franke <jornfra...@gmail.com>
> wrote:
> >
> > > Purging will never prevent that it does not get replicated for sure.
> > There
> > > will be always a case (error to purge etc) and then it is still
> > replicated.
> > > You may reduce the probability but it will never be impossible.
> > >
> > > Your application should be able to handle duplicated messages.
> > >
> > > > On 25. May 2018, at 08:54, Shantanu Deshmukh <shantanu...@gmail.com>
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > We have cross data center replication. Using Kafka mirror maker we
> are
> > > > replicating data from our primary cluster to backup cluster. Problem
> > > arises
> > > > when we start operating from backup cluster, in case of drill or
> actual
> > > > outage. Data gathered at backup cluster needs to be
> reverse-replicated
> > to
> > > > primary. To do that I can only think of two options. 1) Use a
> different
> > > CG
> > > > every time for mirror maker 2) Purge topics so that data sent by
> > primary
> > > > doesn't get replicated back to primary again due to reverse
> > replication.
> > > >
> > > > We have opted for purging Kafka topics which are under replication. I
> > use
> > > > kafka-topics.sh --alter command to set retention of topic to 5
> seconds
> > to
> > > > purge data. But this doesn't see to be a fool proof mechanism. Thread
> > > > responsible for doing this every minute, and even if it runs it's not
> > > sure
> > > > to work as there are multiple conditions. That, segment should be
> full
> > or
> > > > certain time should have passed to roll a new segment. It so happened
> > > > during one such drill to move to backup cluster, purge command was
> > issued
> > > > and we waited for 5 minutes. Still data wasn't purged. Due to this we
> > > faced
> > > > data duplication when reverse replication started.
> > > >
> > > > Is there a better way to achieve this?
> > >
> >
>

Re: Reliable way to purge data from Kafka topics

Reply via email to