Thanks Andrew.

I'll give linger.ms a try.

I was testing worse case scenarios so linger.ms was set to 0. also the
producer was doing ack=all. which definitely adds all the producer requests
to the purgatory waiting to be acknowledged.

thanks.

On Sat, Mar 4, 2023 at 2:57 PM Andrew Grant <agr...@confluent.io.invalid>
wrote:

> Hey David, The followers replicate from the leader and when they do that
> they write to their own local log. For the ceph cluster, it sounds like the
> followers writes to their local log are slower? Seems like that would sense
> if those writes
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Hey David,
>
> The followers replicate from the leader and when they do that they write to
> their own local log. For the ceph cluster, it sounds like the followers
> writes to their local log are slower? Seems like that would sense if those
> writes are going over the network. This could explain why the leader ends
> up having to wait longer to hear back from the followers before sending the
> produce response, which in turn could explain why the producer purgatory is
> bigger. See the section "Commit time: Replicating the record from leader to
> followers" 
> inhttps://urldefense.proofpoint.com/v2/url?u=https-3A__www.confluent.io_blog_configure-2Dkafka-2Dto-2Dminimize-2Dlatency_&d=DwIFaQ&c=qE8EibqjfXM-zBfebVhd4gtjNZbrDcrKYXvb1gt38s4&r=p-f3AJg4e4Uk20g_16kSyBtabT4JOB-1GIb23_CxD58&m=zLuPFKcuQiGQ2M4jmewuPY50sE4866smXx9rdOQAGEb_iQgWdhTDBueMBHYkRjJD&s=c5f8mCYTeaXvOyDsuOUPUaJTUfQznxeAocPSU93sIds&e=.
>
> To amortize the cost of slower followers you could look into 
> increasinglinger.ms so that the producer batches a bit more.
>
> Hope that helps a bit.
>
> Andrew
>
> On Mon, Feb 27, 2023 at 3:39 PM David Ballano Fernandez 
> <dfernan...@demonware.net> wrote:
>
> > thank you!
> >
> > On Mon, Feb 27, 2023 at 12:37 PM David Ballano Fernandez <
> > dfernan...@demonware.net> wrote:
> >
> > > Hi guys,
> > >
> > > I am loadtesting a couple clusters one with local ssd disks and another
> > > one with ceph.
> > >
> > > Both clusters have the same amount of cpu/ram and they are configured the
> > > same way.
> > > im sending the same amount of messages and producing with linger.ms=0
> > and
> > > acks=all
> > >
> > > besides seeing higuer latencies on ceph for the most part, compared to
> > > local disk. There is something that I don't understand.
> > >
> > > On the local disk cluster. messages per second matches exactly the
> > > number of requests.
> > > but on the ceph cluster messages  do not match total produce requests per
> > > second.
> > >
> > > and the only thing I can find is that the Producer purgatory in ceph
> > kafka
> > > cluster has more request queued up than the local disk.
> > >
> > > Also RemoteTime-ms for producers is high, which could explain why there
> > > are more requests on the purgatory.
> > >
> > > To me , I think this means that the Producer is waiting to hear from all
> > > the acks. which are set to all. But I don't understand why the local disk
> > > Kafka cluster purgatory queue is way lower.
> > >
> > > since I don't think disk is used for this? could be network saturation
> > > since ceph  is network storage is interfering with the  producer waiting
> > > for acks? is there a way to tune the producer purgatory? I did change
> > > num.replica.fetchers but that only lowered the fetch purgatory.
> > >
> > >
> > >
> > >
> > >
> >
>
>

Reply via email to