Thanks for the answer. I read the tutorials but I was still confused. The repartition is a little confusing for me.
You wrote: “in the complete method of your Aggregator, only the tuples in the current partion will have been aggregated” However, in the second link you added, it is mentioned that: “aggregate is run on each batch of the stream in isolation,” So I am confused. Doesn’t the complete method run as mentioned in the link on all batches? Or does it run on all parts of batches that run on the same partition? Thanks, *From:* Laurent Thoulon [mailto:[email protected]] *Sent:* Tuesday, January 21, 2014 12:13 PM *To:* [email protected] *Subject:* Re: batch and partition - differences To undertsand this, if you haven't already read those, i'd advise you to take a look at https://github.com/nathanmarz/storm/wiki/Trident-tutorial then https://github.com/nathanmarz/storm/wiki/Trident-API-Overview To sumarize this and reuse the words of Nathan, you will have for instance 100 tuples in one batch. Then partionning kicks in and you may have one partition with 30 tuples, an other with 50 and a last one with 20 that may be parallelized. If you use .partitionBy, you will define the "key fields" of your partionning. For instance partitionBy(new Fields("id")) will ensure that all the tuples from the batch that have a matching "id" field be grouped in the same partition. Aggregation and groupBy uses that mechanism as it's necessary to have the whole set of tuples that you want to aggregate together. That beeing said, you should understand that in the complete method of your Aggregator, only the tuples in the current partion will have been aggregated. You can still use .global() or .batchGlobal() to aggregate all the tuples of the batch into one partition. Laurent ------------------------------ *De: *"Michal Singer" <[email protected]> *À: *[email protected] *Envoyé: *Mardi 21 Janvier 2014 08:37:59 *Objet: *RE: batch and partition - differences So a batch can be divided into multiple partitions? And then for example a aggregator will aggregate all the tuples in the batch in the complete method? thanks *From:* [email protected] [mailto:[email protected]] *On Behalf Of *Nathan Marz *Sent:* Tuesday, January 21, 2014 7:06 AM *To:* [email protected] *Subject:* Re: batch and partition - differences A batch is all the tuples being computed on at once each run of the topology. Each stage of the processing is split into partitions for parallelization. On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer <[email protected]> wrote: Hi, it is not clear what is the different between batch and partition on the Trident. Is partition the task that the batch is performed on? Can someone explain the difference? thanks -- Twitter: @nathanmarz http://nathanmarz.com
