Just a few additions to what Eno said: if you are using the Streams DSL to code your applications, and you are aggregating based on the non-key of the input stream, the re-partitioning will be automatically conducted with the auto-generated internal repartition topic. So you are guaranteed that the final results will not have aggregated keys sitting in more than one instance, and hence no reduce operation is needed as follow-up.
Guozhang On Sun, Jul 24, 2016 at 7:39 PM, Eno Thereska <eno.there...@gmail.com> wrote: > Hi Michael-Keith, > > Good question. Two answers: in the default case the same key (e.g., > "world") would end up in the same partition, so you wouldn't have the > example you describe here where the same key is in two different partitions > of the same topic. E.g., this default case applies if you are writing to > Kafka using Kafka Connect etc. In this default case, so further reduce is > needed since each partition would have unique keys. > > However, if you are really stuck with the same key in two different > partitions, after summing, you'd have to repartition the keys from each of > the 2 partitions and write them to an intermediate topic, e.g., using > ".through". After that, you'd have to then use aggregation again to compute > the final sum. > > Thanks > Eno > > > On 20 Jul 2016, at 11:54, Michael-Keith Bernard <mkbern...@opentable.com> > wrote: > > > > Hello Kafka Users, > > > > I've been floating this question around the #apache-kafka IRC channel on > Freenode for the last week or two and I still haven't reached a satisfying > answer. The basic question is: How does Kafka Steams merge partial results? > So let me expand on that a bit... > > > > Consider the following word count example in the official Kafka Streams > repo (Github mirror): > https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46 > > > > Now suppose we run this topology against a Kafka topic with 2 > partitions. Into the first partition, we insert the sentence "hello world". > Into the second partition, we insert the sentence "goodbye world". We know > a priori that the result of this computation is something like: > > > > { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted > log/KTable state > > > > And indeed we'd probably see precisely that result from a *single* > consumer process that sees *all* the data. However, my question is, what if > I have 1 consumer per topic partition (2 consumers total in the above > hypothetical)? Under that scenario, consumer 1 would emit { hello: 1, > world: 1 } and consumer 2 would emit { goodbye: 1, world: 1 }... But the > true answer requires and additional reduction of duplicate keys (in this > case with a sum operator, but that needn't be the case for arbitrary > aggregations). > > > > So again my question is, how are the partial results that each consumer > generates merged into a final result? A simple way to accomplish this would > be to produce an intermediate topic that is keyed by the word, then > aggregate that (since each consumer would see all the data for a given > key), but if that's happening it's not explicit anywhere in the example. So > what mechanism is Kafka Streams using internally to aggregate the results > of a partitioned stream across multiple consumers? (Perhaps groupByKey > creating an anonymous intermediate topic?) > > > > I know that's a bit wordy, but I want to make sure my question is > extremely clear. If I've still fumbled on that, let me know and I will try > to be even more explicit. :) > > > > Cheers, > > Michael-Keith Bernard > > > > P.S. Kafka is awesome and Kafka Streams look even awesome-er! > > -- -- Guozhang