Re: Kafka Scaling Ideas

Haruki Okada Sun, 20 Dec 2020 19:22:45 -0800

It depends on how you manually commit offsets.
Auto-commit does commits offsets in async manner basically, so as long as
you do manual-commit in the same way,  there should be no much difference.


And, generally offset-commit mode doesn't make much difference in
performance regardless manual/auto or async/sync unless offset-commit
latency takes significant amount in processing time (e.g. you commit
offsets synchronously in every poll() loop).

2020年12月21日(月) 11:08 Yana K <[email protected]>:

> Thank you so much Marina and Haruka.
>
> Marina's response:
> - When you say " if you are sure there is no room for perf optimization of
> the processing itself :" - do you mean code level optimizations? Can you
> please explain?
> - On the second topic you say " I'd say at least 40" - is this based on 12
> million records / hour?
> -  "if you can change the incoming topic" - I don't think it is possible :(
> -  "you could artificially achieve the same by adding one more step
> (service) in your pipeline" - this is the next thing - but I want to be
> sure this will help, given we've to maintain one more layer
>
> Haruka's response:
> - "One possible solution is creating an intermediate topic" - I already did
> it
> - I'll look at Decaton - thx
>
> Is there any thoughts on the auto commit vs manual commit - if it can
> better the performance while consuming?
>
> Yana
>
>
>
> On Sat, Dec 19, 2020 at 7:01 PM Haruki Okada <[email protected]> wrote:
>
> > Hi.
> >
> > Yeah, Spring-Kafka does processing messages sequentially, so the consumer
> > throughput would be capped by database latency per single process.
> > One possible solution is creating an intermediate topic (or altering
> source
> > topic) with much more partitions as Marina suggested.
> >
> > I'd like to suggest another solution, that is multi-threaded processing
> per
> > single partition.
> > Decaton (https://github.com/line/decaton) is a library to achieve it.
> >
> > Also confluent has published a blog post about parallel-consumer (
> >
> >
> https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
> > )
> > for that purpose, but it seems it's still in the BETA stage.
> >
> > 2020年12月20日(日) 11:41 Marina Popova <[email protected]>:
> >
> > > The way I see it - you can only do a few things - if you are sure there
> > is
> > > no room for perf optimization of the processing itself :
> > > 1. speed up your processing per consumer thread: which you already
> tried
> > > by splitting your logic into a 2-step pipeline instead of 1-step, and
> > > delegating the work of writing to a DB to the second step ( make sure
> > your
> > > second intermediate Kafka topic is created with much more partitions to
> > be
> > > able to parallelize your work much higher - I'd say at least 40)
> > > 2. if you can change the incoming topic - I would create it with many
> > more
> > > partitions as well - say at least 40 or so - to parallelize your first
> > step
> > > service processing more
> > > 3. and if you can't increase partitions for the original topic ) - you
> > > could artificially achieve the same by adding one more step (service)
> in
> > > your pipeline that would just read data from the original 7-partition
> > > topic1 and just push it unchanged into a new topic2 with , say 40
> > > partitions - and then have your other services pick up from this topic2
> > >
> > >
> > > good luck,
> > > Marina
> > >
> > > Sent with ProtonMail Secure Email.
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Saturday, December 19, 2020 6:46 PM, Yana K <[email protected]>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I am new to the Kafka world and running into this scale problem. I
> > > thought
> > > > of reaching out to the community if someone can help.
> > > > So the problem is I am trying to consume from a Kafka topic that can
> > > have a
> > > > peak of 12 million messages/hour. That topic is not under my control
> -
> > it
> > > > has 7 partitions and sending json payload.
> > > > I have written a consumer (I've used Java and Spring-Kafka lib) that
> > will
> > > > read that data, filter it and then load it into a database. I ran
> into
> > a
> > > > huge consumer lag that would take 10-12hours to catch up. I have 7
> > > > instances of my application running to match the 7 partitions and I
> am
> > > > using auto commit. Then I thought of splitting the write logic to a
> > > > separate layer. So now my architecture has a component that reads and
> > > > filters and produces the data to an internal topic (I've done 7
> > > partitions
> > > > but as you see it's under my control). Then a consumer picks up data
> > from
> > > > that topic and writes it to the database. It's better but still it
> > takes
> > > > 3-5hours for the consumer lag to catch up.
> > > > Am I missing something fundamentally? Are there any other ideas for
> > > > optimization that can help overcome this scale challenge. Any pointer
> > and
> > > > article will help too.
> > > >
> > > > Appreciate your help with this.
> > > >
> > > > Thanks
> > > > Yana
> > >
> > >
> > >
> >
> > --
> > ========================
> > Okada Haruki
> > [email protected]
> > ========================
> >
>


-- 
========================
Okada Haruki
[email protected]
========================

Re: Kafka Scaling Ideas

Reply via email to