Hi,

Not sure if this helps, but the way Loggly seem to do it is to have a
separate topic for "noisy neighbors". See [1].

[1]
https://www.loggly.com/blog/loggly-loves-apache-kafka-use-unbreakable-messaging-better-log-management/

Cheers,
Jens

On Wed, Apr 27, 2016 at 9:11 PM Srikanth <srikanth...@gmail.com> wrote:

> Hello,
>
> Is there a recommendation for handling producer side partitioning based on
> a key with skew?
> We want to partition on something like clientId. Problem is, this key has
> an uniform distribution.
> Its equally likely to see a key with 3k occurrence/day vs 100k/day vs
> 65million/day.
> Cardinality of key is around 1500 and there are approx 1 billion records
> per day.
> Partitioning by hashcode(key)%numOfPartition will create a few "hot
> partitions" and cause a few brokers(and consumer threads) to be overloaded.
> May be these partitions with heavy load are evenly distributed among
> brokers, may be they are not.
>
> I read KIP-22
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-22+-+Expose+a+Partitioner+interface+in+the+new+producer
> >
> that
> explains how one could write a custom partitioner.
> I'd like to know how it was used to solve such data skew.
> We can compute some statistics on key distribution offline and use it in
> the partitioner.
> Is that a good idea? Or is it way too much logic for a partitioner?
> Anything else to consider?
> Any thoughts or reference will be helpful.
>
> Thanks,
> Srikanth
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Reply via email to