Yeah I agree we could have handled this better. I think the story we have
now is that you can override it using the partition argument in the
producer (and when we get the patch for pluggable producer we can bundle a
LegacyPartitioner or something like that).

The reason for murmur2 over 3 was that it had a good single-class java
implementation. The only mumur 3 impl I could find was extremely complex
and hard to bundle, and I really wanted to avoid depending on something
like Guava which ends up being kind of a nightmare from a dependency mgmt
perspective for client libs.

-Jay

On Sun, Apr 26, 2015 at 9:03 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> Definitely +1 for advertising this in the docs.
>
> What I can't figure out is the upgrade path... if my application assumes
> that all data for a single user is in one partition (so it subscribes to a
> single partition and expects everything about a specific subset of users to
> be in that partition), this assumption will not survive an upgrade to
> 0.8.2.X.  I think the assumption of stable hash partitions even after
> upgrades is pretty reasonable (i.e. I made it about gazillion times without
> thinking twice). Note that in this story my app wasn't even upgraded - it
> broke because a producer upgraded to a new API.
>
> If we advertise: "upgrading to the new producer API may break consumers",
> we may need to offer a work-around to allow people to upgrade producers
> anyway.
> Perhaps we can say "wait for Sriharsha's partitioner patch and write a
> custom partitioner that uses hashcode()".
>
> Gwen
>
>
>
> On Sun, Apr 26, 2015 at 7:57 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > This was actually intentional.
> >
> > The problem with relying on hashCode is that
> > (1) it is often a very bad hash function,
> > (2) it is not guaranteed to be consistent from run to run (i.e. if you
> > restart the jvm the value of hashing the same key can change!),
> > (3) it is not available outside the jvm so non-java producers can't use
> the
> > same function.
> >
> > In general at the moment different producers don't use the same hash code
> > so I think this is not quite as bad as it sounds. Though it would be good
> > to standardize things.
> >
> > I think the most obvious thing we could do here would be to do a much
> > better job of advertising this in the docs, though, so people don't get
> > bitten by it.
> >
> > -Jay
> >
> > On Fri, Apr 24, 2015 at 5:48 PM, James Cheng <jch...@tivo.com> wrote:
> >
> > > Hi,
> > >
> > > I was playing with the new producer in 0.8.2.1 using partition keys
> > > ("semantic partitioning" I believe is the phrase?). I noticed that the
> > > default partitioner in 0.8.2.1 does not partition items the same way as
> > the
> > > old 0.8.1.1 default partitioner was doing. For a test item, the old
> > > producer was sending it to partition 0, whereas the new producer was
> > > sending it to partition 4.
> > >
> > > Digging in the code, it appears that the partitioning logic is
> different
> > > between the old and new producers. Both of them hash the key, but they
> > use
> > > different hashing algorithms.
> > >
> > > Old partitioner:
> > > ./core/src/main/scala/kafka/producer/DefaultPartitioner.scala:
> > >
> > >   def partition(key: Any, numPartitions: Int): Int = {
> > >     Utils.abs(key.hashCode) % numPartitions
> > >   }
> > >
> > > New partitioner:
> > >
> > >
> >
> ./clients/src/main/java/org/apache/kafka/clients/producer/internals/Partitioner.java:
> > >
> > >         } else {
> > >             // hash the key to choose a partition
> > >             return Utils.abs(Utils.murmur2(record.key())) %
> > numPartitions;
> > >         }
> > >
> > > Where murmur2 is a custom hashing algorithm. (I'm assuming that murmur2
> > > isn't the same logic as hashCode, especially since hashCode is
> > > overrideable).
> > >
> > > Was it intentional that the hashing algorithm would change between the
> > old
> > > and new producer? If so, was this documented? I don't know if anyone
> was
> > > relying on the old default partitioner, as opposed to going round-robin
> > or
> > > using their own custom partitioner. Do you expect it to change in the
> > > future? I'm guessing that one of the main reasons to have a custom
> > hashing
> > > algorithm is so that you are full control of the partitioning and can
> > keep
> > > it stable (as opposed to being reliant on hashCode()).
> > >
> > > Thanks,
> > > -James
> > >
> > >
> >
>

Reply via email to