Hi, Not sure if this helps, but the way Loggly seem to do it is to have a separate topic for "noisy neighbors". See [1].
[1] https://www.loggly.com/blog/loggly-loves-apache-kafka-use-unbreakable-messaging-better-log-management/ Cheers, Jens On Wed, Apr 27, 2016 at 9:11 PM Srikanth <srikanth...@gmail.com> wrote: > Hello, > > Is there a recommendation for handling producer side partitioning based on > a key with skew? > We want to partition on something like clientId. Problem is, this key has > an uniform distribution. > Its equally likely to see a key with 3k occurrence/day vs 100k/day vs > 65million/day. > Cardinality of key is around 1500 and there are approx 1 billion records > per day. > Partitioning by hashcode(key)%numOfPartition will create a few "hot > partitions" and cause a few brokers(and consumer threads) to be overloaded. > May be these partitions with heavy load are evenly distributed among > brokers, may be they are not. > > I read KIP-22 > < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-22+-+Expose+a+Partitioner+interface+in+the+new+producer > > > that > explains how one could write a custom partitioner. > I'd like to know how it was used to solve such data skew. > We can compute some statistics on key distribution offline and use it in > the partitioner. > Is that a good idea? Or is it way too much logic for a partitioner? > Anything else to consider? > Any thoughts or reference will be helpful. > > Thanks, > Srikanth > -- Jens Rantil Backend Developer @ Tink Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden For urgent matters you can reach me at +46-708-84 18 32.