Hi Michael-Keith, Good question. Two answers: in the default case the same key (e.g., "world") would end up in the same partition, so you wouldn't have the example you describe here where the same key is in two different partitions of the same topic. E.g., this default case applies if you are writing to Kafka using Kafka Connect etc. In this default case, so further reduce is needed since each partition would have unique keys.
However, if you are really stuck with the same key in two different partitions, after summing, you'd have to repartition the keys from each of the 2 partitions and write them to an intermediate topic, e.g., using ".through". After that, you'd have to then use aggregation again to compute the final sum. Thanks Eno > On 20 Jul 2016, at 11:54, Michael-Keith Bernard <mkbern...@opentable.com> > wrote: > > Hello Kafka Users, > > I've been floating this question around the #apache-kafka IRC channel on > Freenode for the last week or two and I still haven't reached a satisfying > answer. The basic question is: How does Kafka Steams merge partial results? > So let me expand on that a bit... > > Consider the following word count example in the official Kafka Streams repo > (Github mirror): > https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46 > > Now suppose we run this topology against a Kafka topic with 2 partitions. > Into the first partition, we insert the sentence "hello world". Into the > second partition, we insert the sentence "goodbye world". We know a priori > that the result of this computation is something like: > > { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted > log/KTable state > > And indeed we'd probably see precisely that result from a *single* consumer > process that sees *all* the data. However, my question is, what if I have 1 > consumer per topic partition (2 consumers total in the above hypothetical)? > Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and > consumer 2 would emit { goodbye: 1, world: 1 }... But the true answer > requires and additional reduction of duplicate keys (in this case with a sum > operator, but that needn't be the case for arbitrary aggregations). > > So again my question is, how are the partial results that each consumer > generates merged into a final result? A simple way to accomplish this would > be to produce an intermediate topic that is keyed by the word, then aggregate > that (since each consumer would see all the data for a given key), but if > that's happening it's not explicit anywhere in the example. So what mechanism > is Kafka Streams using internally to aggregate the results of a partitioned > stream across multiple consumers? (Perhaps groupByKey creating an anonymous > intermediate topic?) > > I know that's a bit wordy, but I want to make sure my question is extremely > clear. If I've still fumbled on that, let me know and I will try to be even > more explicit. :) > > Cheers, > Michael-Keith Bernard > > P.S. Kafka is awesome and Kafka Streams look even awesome-er!