Hi Ben,

Thanks for quick reply. As you mentioned, we are seeing the write throughput 
per broker being capped at 250 MB/s. 

We tried multiple scenarios as below

1) Msg size 1 KB, 3 publishers, 3 brokers, with 3 parritions and RF = 2 
In this, as mentioned in earlier emai, with replication on Total throughput was 
never more than 90 MB/s per cluster. So each broker was writing 60 MB/s (30 
MB/s as leader + 30 MB as follower replica). Thats just 1/4 th (25%) of the 
total EBS throughput limit of 250 MB/s. However without replication we are able 
to get 126 MB/s per broker and total throughput of 380 MB/s. Question was why 
does the total throughput drop so much (380 to 90 MB/s) when the replication is 
on. Why does replication slow down the broker to 25% of its speed?

2) Msg size 1 KB, 6 publishers, 6 brokers with 6 partitions and RF =1 
In this case throughput without replication (RF=1) was 1487 MB/s total and 250 
MB/s per broker. At this point, EBS must be reaching its upper limit. Is there 
a way for us to choose "Povisioned (high) IOPS"? As I am suspecting in this 
case, the disk is a bottleneck. However how do we know that the network is not 
a bottleneck? Because the EC2 instance type used for publisher (m5.2xlarge 8 
core) and broker (m5.large 2 core) show the N/W bandwidth as 10 Gbps = 1.25 
GBps. Is it possible to get EBS volumes with higher IO throughput?

Thanks,
Arti Pande

On 2020/09/09 15:26:22, Ben Weintraub <benwe...@gmail.com> wrote: 
> Hi Arti,
> 
> From your description, it sounds like your per-partition throughput is
> extremely high (380 MB/s across three partitions is ~125 MB/s per
> partition). A more typical high-end range would be 5-10 MB/s per partition.
> Each follower gets mapped to a single replica fetcher thread on the
> follower side, which maintains a single TCP connection to the leader, and
> that connection is in turn mapped to a single network processor thread on
> the leader side. With the setup you described, you have a total of 3
> [partitions] * (2 - 1) [replicas per partition, minus one for the leader]
> total followers = 3 across the whole cluster, but 10 network threads per
> broker. In this scenario, you're likely not actually spreading the
> replication traffic across those multiple network processor threads. This
> means that you're probably pinning a single core per broker doing TLS for
> replication, while the others are mostly idle (or servicing client
> requests).
> 
> Bumping your partition count will likely allow you to spread replication
> load across more of your network processor threads, and achieve higher CPU
> utilization.
> 
> In addition, it's important to consider not only disk usage but disk write
> throughput. With 380 MB/s of ingress traffic and an RF of 2, you'll need a
> total of 380 * 2 = 760 MB/s of disk write throughput, or about 250 MB/s per
> broker. Coincidentally (or not), MSK brokers use gp2 EBS volumes
> <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html>
> for storing the Kafka log directory, which means you're limited to an
> absolute max of 250 MiB/s of write throughput per broker. You probably want
> to allow yourself some headroom here too, so I'd suggest that you start
> from your desired ingress throughput and replication factor, plus your
> desired safety margin, and then work backwards to figure out how many gp2
> volumes you'll need to achieve that level of write throughput safely. Also
> note that while CPU and RAM scale with the larger MSK instance types, disk
> throughput does not.
> 
> Cheers,
> Ben
> 
> On Wed, Sep 9, 2020 at 7:43 AM Arti Pande <pande.a...@gmail.com> wrote:
> 
> > Hi,
> >
> > We have been evaluating Managed Streaming for Kafka (MSK) on AWS for a
> > use-case that requires high-speed data ingestion of the order of millions
> > of messages (each ~1 KB size) per second. We ran into some issues when
> > testing this case.
> >
> > Context:
> > To start with, we have set up single topic with 3 partitions on a 3 node
> > MSK of m5.large (2 cores, 8 GB RAM, 500 GB EBS) with encryption enabled for
> > inter-broker (intra-MSK) communication. Each broker is in a separate AZ
> > (total 3 AZs and 3 brokers) and has 10 network threads and 16 IO threads.
> >
> > When the topic has replication-factor = 2  and min.insync.replicas = 2 and
> > publisher uses acks = all, when sending 100+ million messages using 3
> > parallel publishers intermittently results in following error.
> >           `Delivery failed: Broker: Not enough in-sync replicas`
> > As per documentation this error is thrown when ins-sync replicas are
> > lagging behind for more than a configured duration (
> > replica.lag.time.max.ms=30 seconds as default).
> >
> > However when we don't see this error, the throughput is around 90 K
> > msgs/sec i.e. 90 MB/sec. CPU usage is below 50% disk usage is also < 20%.
> > So apparently CPU/Memory/Disk are not an issue ??
> >
> > If we change replication-factor =1 and min.insync.replicas = 1 and/or
> > ack=1 and keep all other things same, then there are no errors and
> > throughput is ~380 K msgs.sec i.e. 380 MB/sec. CPU usage was below < 30 %
> >
> > Question:
> > Without replication we were able to get 380 MB/sec written, so assuming
> > disk or CPU or memory are not an issue. what could be the reason for
> > replicas to lag behind at 90 MB/sec throughput? Is it the number of total
> > threads (10 n/w + 16 IO) being too high for a 2 core machine? But then same
> > thread setting works good without replication. What could be the reason for
> > (1) lesser throughput when turning replication on and (2) replicas lagging
> > behind when replication is turned on?
> >
> > Thanks
> > Arti
> >
> 

Reply via email to