Thanks Andrew. I'll give linger.ms a try.
I was testing worse case scenarios so linger.ms was set to 0. also the producer was doing ack=all. which definitely adds all the producer requests to the purgatory waiting to be acknowledged. thanks. On Sat, Mar 4, 2023 at 2:57 PM Andrew Grant <agr...@confluent.io.invalid> wrote: > Hey David, The followers replicate from the leader and when they do that > they write to their own local log. For the ceph cluster, it sounds like the > followers writes to their local log are slower? Seems like that would sense > if those writes > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Hey David, > > The followers replicate from the leader and when they do that they write to > their own local log. For the ceph cluster, it sounds like the followers > writes to their local log are slower? Seems like that would sense if those > writes are going over the network. This could explain why the leader ends > up having to wait longer to hear back from the followers before sending the > produce response, which in turn could explain why the producer purgatory is > bigger. See the section "Commit time: Replicating the record from leader to > followers" > inhttps://urldefense.proofpoint.com/v2/url?u=https-3A__www.confluent.io_blog_configure-2Dkafka-2Dto-2Dminimize-2Dlatency_&d=DwIFaQ&c=qE8EibqjfXM-zBfebVhd4gtjNZbrDcrKYXvb1gt38s4&r=p-f3AJg4e4Uk20g_16kSyBtabT4JOB-1GIb23_CxD58&m=zLuPFKcuQiGQ2M4jmewuPY50sE4866smXx9rdOQAGEb_iQgWdhTDBueMBHYkRjJD&s=c5f8mCYTeaXvOyDsuOUPUaJTUfQznxeAocPSU93sIds&e=. > > To amortize the cost of slower followers you could look into > increasinglinger.ms so that the producer batches a bit more. > > Hope that helps a bit. > > Andrew > > On Mon, Feb 27, 2023 at 3:39 PM David Ballano Fernandez > <dfernan...@demonware.net> wrote: > > > thank you! > > > > On Mon, Feb 27, 2023 at 12:37 PM David Ballano Fernandez < > > dfernan...@demonware.net> wrote: > > > > > Hi guys, > > > > > > I am loadtesting a couple clusters one with local ssd disks and another > > > one with ceph. > > > > > > Both clusters have the same amount of cpu/ram and they are configured the > > > same way. > > > im sending the same amount of messages and producing with linger.ms=0 > > and > > > acks=all > > > > > > besides seeing higuer latencies on ceph for the most part, compared to > > > local disk. There is something that I don't understand. > > > > > > On the local disk cluster. messages per second matches exactly the > > > number of requests. > > > but on the ceph cluster messages do not match total produce requests per > > > second. > > > > > > and the only thing I can find is that the Producer purgatory in ceph > > kafka > > > cluster has more request queued up than the local disk. > > > > > > Also RemoteTime-ms for producers is high, which could explain why there > > > are more requests on the purgatory. > > > > > > To me , I think this means that the Producer is waiting to hear from all > > > the acks. which are set to all. But I don't understand why the local disk > > > Kafka cluster purgatory queue is way lower. > > > > > > since I don't think disk is used for this? could be network saturation > > > since ceph is network storage is interfering with the producer waiting > > > for acks? is there a way to tune the producer purgatory? I did change > > > num.replica.fetchers but that only lowered the fetch purgatory. > > > > > > > > > > > > > > > > > > >