Hi Greg,

Thank you very much for the quick and detailed response. Our clients are 2.5.0 
so they do have the problematic version of the partitioner. From the metrics we 
have available, it's not 100% clear it is this issue, but there was a restart 
of some large components at the same time the problem started, so it is 
certainly plausible that it could have temporarily affected the node that went 
bad. We'll plan to upgrade the client and hope that solves it.

Thanks!
Meg

-----Original Message-----
From: Greg Harris <greg.har...@aiven.io.INVALID> 
Sent: Friday, March 24, 2023 5:13 PM
To: users@kafka.apache.org
Subject: Re: Sudden imbalance between partitions

CAUTION: External Email : Be wary of clicking links or if this claims to be 
internal.

Meg,

What version are your clients, and what partitioner are you using for these 
records?

If you're using the DefaultPartitioner from 2.4.0+, it has a known imbalance 
flaw that is described and addressed by this KIP:
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-794%253A%2BStrictly%2BUniform%2BSticky%2BPartitioner&data=05%7C01%7Cmargaret.figura%40infovista.com%7Cf11c080bc6d04b39b96608db2cac9412%7Cc8d853de982e440492ffb4189dc94e37%7C0%7C0%7C638152891957621357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1BQWQ4Q3TnXfie3xGAmPYrDNh4TDDqRtNWDbxIqp5Pg%3D&reserved=0
which was released in 3.3.0.
In order to make sure you're using the patched partitioner, the clients jar 
should be on 3.3.0+ and your application should not set the `partitioner.class` 
configuration, to let the producer choose the behavior.

In the short term, pausing, throttling, or restarting producers may help 
resolve the imbalance, since the poor balance is caused by the state of the 
producer buffers.
Adding nodes to the cluster and spreading partitions thinner may also help 
increase the tolerance of each broker before it becomes unbalanced.
However, this will not solve the problem on its own, and may make it 
temporarily worse while partitions are being replicated to the added nodes.
If you're already running the patched version of the partitioner, then a more 
detailed investigation will be necessary.

I hope some of this helps!
Greg Harris

On Fri, Mar 24, 2023 at 11:57 AM Margaret Figura 
<margaret.fig...@infovista.com.invalid> wrote:

> Hi,
>
> We have a 22-node Kafka 3.3.1 cluster on K8s. All data is sent with 
> null partitionId and null key from 20 Java producers, so it should be 
> distributed evenly across partitions. All was good for days, but a 
> couple hours ago, broker 21 started receiving about 2x the data of the 
> other brokers for a few topics (but not all). These topics are all 1x 
> replicated and the 96 partitions are distributed evenly across brokers 
> (each broker has 4 or 5 partitions). This was detected in Grafana, but 
> I can also see the offsets increasing much faster for the partitions 
> owned by broker 21 in KafkaOffsetsShell. What could cause this? I 
> didn't see anything unusual in the broker 21 logs or the controller logs.
>
> Looking back, I noticed that broker 11 also becomes a bit unbalanced 
> each day at the time when we are processing the most data, but it is 
> only 10-15% higher than the others. All other brokers are quite even, 
> including broker
> 21 until today.
>
> Any ideas on what I can check? Unfortunately we'll probably have to 
> restart Kafka and/or the producers pretty soon.
>
> Thanks a lot!
> Meg
>

Reply via email to