+1.
I have the same question after reading Laurent's nice answer.

2014/1/22 Susheel Kumar Gadalay <[email protected]>

> Here, what is the difference of partition and group by?
> A partition can have multiple group by?
>
> On 1/22/14, Laurent Thoulon <[email protected]> wrote:
> > Hi Michal
> >
> > Indeed, i'll quote the wiki:
> > Running aggregate on a Stream does a global aggregation.
> > If you run aggregators on a grouped stream, the aggregation will be run
> > within each group instead of against the whole batch.
> > partitionAggregate runs a function on each partition of a batch of
> tuples.
> >
> > So to, in the complete method of your Aggregator, depending on how you
> > managed your stream, i understand you will :
> > - Aggregate all the tuples of the batch if you just used .aggregate on a
> > stream
> > - Aggregate the tuples by group if you used .groupBy().aggregate()
> > - Aggregate the tuples of the partition defined by storm if you used
> > partitionAggregate()
> > - Aggregate the tuples of the partition you defined if you use
> > .partitionBy().partitionAggregate()
> >
> >
> > Laurent
> > ----- Mail original -----
> >
> > De: "Michal Singer" <[email protected]>
> > À: [email protected]
> > Envoyé: Mardi 21 Janvier 2014 11:47:05
> > Objet: RE: batch and partition - differences
> >
> >
> >
> > Actually, isn’t the difference between aggregate and partitionAggregate
> that
> > aggregate runs on all the batch and partitionAggregate on the partition
> of
> > the batch?
> > thanks
> >
> >
> >
> > From: Michal Singer [mailto: [email protected] ]
> > Sent: Tuesday, January 21, 2014 12:44 PM
> > To: ' [email protected] '
> > Subject: RE: batch and partition - differences
> >
> > Thanks for the answer.
> > I read the tutorials but I was still confused. The repartition is a
> little
> > confusing for me.
> > You wrote: “ in the complete method of your Aggregator, only the tuples
> in
> > the current partion will have been aggregated”
> > However, in the second link you added, it is mentioned that: “ aggregate
> is
> > run on each batch of the stream in isolation,”
> > So I am confused.
> > Doesn’t the complete method run as mentioned in the link on all batches?
> Or
> > does it run on all parts of batches that run on the same partition?
> >
> > Thanks,
> >
> >
> >
> >
> > From: Laurent Thoulon [ mailto:[email protected] ]
> > Sent: Tuesday, January 21, 2014 12:13 PM
> > To: [email protected]
> > Subject: Re: batch and partition - differences
> >
> >
> > To undertsand this, if you haven't already read those, i'd advise you to
> > take a look at
> > https://github.com/nathanmarz/storm/wiki/Trident-tutorial
> > then
> > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
> >
> > To sumarize this and reuse the words of Nathan, you will have for
> instance
> > 100 tuples in one batch. Then partionning kicks in and you may have one
> > partition with 30 tuples, an other with 50 and a last one with 20 that
> may
> > be parallelized.
> >
> > If you use .partitionBy, you will define the "key fields" of your
> > partionning. For instance partitionBy(new Fields("id")) will ensure that
> all
> > the tuples from the batch that have a matching "id" field be grouped in
> the
> > same partition.
> > Aggregation and groupBy uses that mechanism as it's necessary to have the
> > whole set of tuples that you want to aggregate together.
> >
> > That beeing said, you should understand that in the complete method of
> your
> > Aggregator, only the tuples in the current partion will have been
> > aggregated. You can still use .global() or .batchGlobal() to aggregate
> all
> > the tuples of the batch into one partition.
> >
> >
> > Laurent
> >
> >
> >
> > De: "Michal Singer" < [email protected] >
> > À: [email protected]
> > Envoyé: Mardi 21 Janvier 2014 08:37:59
> > Objet: RE: batch and partition - differences
> > So a batch can be divided into multiple partitions? And then for example
> a
> > aggregator will aggregate all the tuples in the batch in the complete
> > method?
> > thanks
> >
> > From: [email protected] [mailto: [email protected] ] On Behalf
> Of
> > Nathan Marz
> > Sent: Tuesday, January 21, 2014 7:06 AM
> > To: [email protected]
> > Subject: Re: batch and partition - differences
> >
> >
> > A batch is all the tuples being computed on at once each run of the
> > topology. Each stage of the processing is split into partitions for
> > parallelization.
> >
> >
> >
> > On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] >
> > wrote:
> >
> >
> > Hi, it is not clear what is the different between batch and partition on
> the
> > Trident.
> > Is partition the task that the batch is performed on?
> > Can someone explain the difference?
> > thanks
> >
> >
> >
> >
> >
> > --
> >
> >
> > Twitter: @nathanmarz
> > http://nathanmarz.com
> >
> >
>

Reply via email to