Here, what is the difference of partition and group by?
A partition can have multiple group by?

On 1/22/14, Laurent Thoulon <[email protected]> wrote:
> Hi Michal
>
> Indeed, i'll quote the wiki:
> Running aggregate on a Stream does a global aggregation.
> If you run aggregators on a grouped stream, the aggregation will be run
> within each group instead of against the whole batch.
> partitionAggregate runs a function on each partition of a batch of tuples.
>
> So to, in the complete method of your Aggregator, depending on how you
> managed your stream, i understand you will :
> - Aggregate all the tuples of the batch if you just used .aggregate on a
> stream
> - Aggregate the tuples by group if you used .groupBy().aggregate()
> - Aggregate the tuples of the partition defined by storm if you used
> partitionAggregate()
> - Aggregate the tuples of the partition you defined if you use
> .partitionBy().partitionAggregate()
>
>
> Laurent
> ----- Mail original -----
>
> De: "Michal Singer" <[email protected]>
> À: [email protected]
> Envoyé: Mardi 21 Janvier 2014 11:47:05
> Objet: RE: batch and partition - differences
>
>
>
> Actually, isn’t the difference between aggregate and partitionAggregate that
> aggregate runs on all the batch and partitionAggregate on the partition of
> the batch?
> thanks
>
>
>
> From: Michal Singer [mailto: [email protected] ]
> Sent: Tuesday, January 21, 2014 12:44 PM
> To: ' [email protected] '
> Subject: RE: batch and partition - differences
>
> Thanks for the answer.
> I read the tutorials but I was still confused. The repartition is a little
> confusing for me.
> You wrote: “ in the complete method of your Aggregator, only the tuples in
> the current partion will have been aggregated”
> However, in the second link you added, it is mentioned that: “ aggregate is
> run on each batch of the stream in isolation,”
> So I am confused.
> Doesn’t the complete method run as mentioned in the link on all batches? Or
> does it run on all parts of batches that run on the same partition?
>
> Thanks,
>
>
>
>
> From: Laurent Thoulon [ mailto:[email protected] ]
> Sent: Tuesday, January 21, 2014 12:13 PM
> To: [email protected]
> Subject: Re: batch and partition - differences
>
>
> To undertsand this, if you haven't already read those, i'd advise you to
> take a look at
> https://github.com/nathanmarz/storm/wiki/Trident-tutorial
> then
> https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
>
> To sumarize this and reuse the words of Nathan, you will have for instance
> 100 tuples in one batch. Then partionning kicks in and you may have one
> partition with 30 tuples, an other with 50 and a last one with 20 that may
> be parallelized.
>
> If you use .partitionBy, you will define the "key fields" of your
> partionning. For instance partitionBy(new Fields("id")) will ensure that all
> the tuples from the batch that have a matching "id" field be grouped in the
> same partition.
> Aggregation and groupBy uses that mechanism as it's necessary to have the
> whole set of tuples that you want to aggregate together.
>
> That beeing said, you should understand that in the complete method of your
> Aggregator, only the tuples in the current partion will have been
> aggregated. You can still use .global() or .batchGlobal() to aggregate all
> the tuples of the batch into one partition.
>
>
> Laurent
>
>
>
> De: "Michal Singer" < [email protected] >
> À: [email protected]
> Envoyé: Mardi 21 Janvier 2014 08:37:59
> Objet: RE: batch and partition - differences
> So a batch can be divided into multiple partitions? And then for example a
> aggregator will aggregate all the tuples in the batch in the complete
> method?
> thanks
>
> From: [email protected] [mailto: [email protected] ] On Behalf Of
> Nathan Marz
> Sent: Tuesday, January 21, 2014 7:06 AM
> To: [email protected]
> Subject: Re: batch and partition - differences
>
>
> A batch is all the tuples being computed on at once each run of the
> topology. Each stage of the processing is split into partitions for
> parallelization.
>
>
>
> On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] >
> wrote:
>
>
> Hi, it is not clear what is the different between batch and partition on the
> Trident.
> Is partition the task that the batch is performed on?
> Can someone explain the difference?
> thanks
>
>
>
>
>
> --
>
>
> Twitter: @nathanmarz
> http://nathanmarz.com
>
>

Reply via email to