RE: batch and partition - differences

Michal Singer Tue, 21 Jan 2014 02:45:07 -0800

Thanks for the answer.

I read the tutorials but I was still confused. The repartition is a little
confusing for me.


You wrote: “in the complete method of your Aggregator, only the tuples in
the current partion will have been aggregated”

However, in the second link you added, it is mentioned that: “aggregate is
run on each batch of the stream in isolation,”

So I am confused.

Doesn’t the complete method run as mentioned in the link on all batches? Or
does it run on all parts of batches that run on the same partition?



Thanks,





*From:* Laurent Thoulon [mailto:[email protected]]
*Sent:* Tuesday, January 21, 2014 12:13 PM
*To:* [email protected]
*Subject:* Re: batch and partition - differences



To undertsand this, if you haven't already read those, i'd advise you to
take a look at
https://github.com/nathanmarz/storm/wiki/Trident-tutorial
then
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview

To sumarize this and reuse the words of Nathan, you will have for instance
100 tuples in one batch. Then partionning kicks in and you may have one
partition with 30 tuples, an other with 50 and a last one with 20 that may
be parallelized.

If you use .partitionBy, you will define the "key fields" of your
partionning. For instance partitionBy(new Fields("id")) will ensure that
all the tuples from the batch that have a matching "id" field be grouped in
the same partition.
Aggregation and groupBy uses that mechanism as it's necessary to have the
whole set of tuples that you want to aggregate together.

That beeing said, you should understand that in the complete method of your
Aggregator, only the tuples in the current partion will have been
aggregated. You can still use .global() or .batchGlobal() to aggregate all
the tuples of the batch into one partition.


Laurent

------------------------------

*De: *"Michal Singer" <[email protected]>
*À: *[email protected]
*Envoyé: *Mardi 21 Janvier 2014 08:37:59
*Objet: *RE: batch and partition - differences

So a batch can be divided into multiple partitions? And then for example a
aggregator will aggregate all the tuples in the batch in the complete
method?

thanks



*From:* [email protected] [mailto:[email protected]] *On
Behalf Of *Nathan
Marz
*Sent:* Tuesday, January 21, 2014 7:06 AM
*To:* [email protected]
*Subject:* Re: batch and partition - differences



A batch is all the tuples being computed on at once each run of the
topology. Each stage of the processing is split into partitions for
parallelization.



On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer <[email protected]> wrote:

Hi, it is not clear what is the different between batch and partition on
the Trident.

Is partition the task that the batch is performed on?

Can someone explain the difference?

thanks





-- 

Twitter: @nathanmarz

http://nathanmarz.com

RE: batch and partition - differences

Reply via email to