+1. I have the same question after reading Laurent's nice answer.
2014/1/22 Susheel Kumar Gadalay <[email protected]> > Here, what is the difference of partition and group by? > A partition can have multiple group by? > > On 1/22/14, Laurent Thoulon <[email protected]> wrote: > > Hi Michal > > > > Indeed, i'll quote the wiki: > > Running aggregate on a Stream does a global aggregation. > > If you run aggregators on a grouped stream, the aggregation will be run > > within each group instead of against the whole batch. > > partitionAggregate runs a function on each partition of a batch of > tuples. > > > > So to, in the complete method of your Aggregator, depending on how you > > managed your stream, i understand you will : > > - Aggregate all the tuples of the batch if you just used .aggregate on a > > stream > > - Aggregate the tuples by group if you used .groupBy().aggregate() > > - Aggregate the tuples of the partition defined by storm if you used > > partitionAggregate() > > - Aggregate the tuples of the partition you defined if you use > > .partitionBy().partitionAggregate() > > > > > > Laurent > > ----- Mail original ----- > > > > De: "Michal Singer" <[email protected]> > > À: [email protected] > > Envoyé: Mardi 21 Janvier 2014 11:47:05 > > Objet: RE: batch and partition - differences > > > > > > > > Actually, isn’t the difference between aggregate and partitionAggregate > that > > aggregate runs on all the batch and partitionAggregate on the partition > of > > the batch? > > thanks > > > > > > > > From: Michal Singer [mailto: [email protected] ] > > Sent: Tuesday, January 21, 2014 12:44 PM > > To: ' [email protected] ' > > Subject: RE: batch and partition - differences > > > > Thanks for the answer. > > I read the tutorials but I was still confused. The repartition is a > little > > confusing for me. > > You wrote: “ in the complete method of your Aggregator, only the tuples > in > > the current partion will have been aggregated” > > However, in the second link you added, it is mentioned that: “ aggregate > is > > run on each batch of the stream in isolation,” > > So I am confused. > > Doesn’t the complete method run as mentioned in the link on all batches? > Or > > does it run on all parts of batches that run on the same partition? > > > > Thanks, > > > > > > > > > > From: Laurent Thoulon [ mailto:[email protected] ] > > Sent: Tuesday, January 21, 2014 12:13 PM > > To: [email protected] > > Subject: Re: batch and partition - differences > > > > > > To undertsand this, if you haven't already read those, i'd advise you to > > take a look at > > https://github.com/nathanmarz/storm/wiki/Trident-tutorial > > then > > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview > > > > To sumarize this and reuse the words of Nathan, you will have for > instance > > 100 tuples in one batch. Then partionning kicks in and you may have one > > partition with 30 tuples, an other with 50 and a last one with 20 that > may > > be parallelized. > > > > If you use .partitionBy, you will define the "key fields" of your > > partionning. For instance partitionBy(new Fields("id")) will ensure that > all > > the tuples from the batch that have a matching "id" field be grouped in > the > > same partition. > > Aggregation and groupBy uses that mechanism as it's necessary to have the > > whole set of tuples that you want to aggregate together. > > > > That beeing said, you should understand that in the complete method of > your > > Aggregator, only the tuples in the current partion will have been > > aggregated. You can still use .global() or .batchGlobal() to aggregate > all > > the tuples of the batch into one partition. > > > > > > Laurent > > > > > > > > De: "Michal Singer" < [email protected] > > > À: [email protected] > > Envoyé: Mardi 21 Janvier 2014 08:37:59 > > Objet: RE: batch and partition - differences > > So a batch can be divided into multiple partitions? And then for example > a > > aggregator will aggregate all the tuples in the batch in the complete > > method? > > thanks > > > > From: [email protected] [mailto: [email protected] ] On Behalf > Of > > Nathan Marz > > Sent: Tuesday, January 21, 2014 7:06 AM > > To: [email protected] > > Subject: Re: batch and partition - differences > > > > > > A batch is all the tuples being computed on at once each run of the > > topology. Each stage of the processing is split into partitions for > > parallelization. > > > > > > > > On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] > > > wrote: > > > > > > Hi, it is not clear what is the different between batch and partition on > the > > Trident. > > Is partition the task that the batch is performed on? > > Can someone explain the difference? > > thanks > > > > > > > > > > > > -- > > > > > > Twitter: @nathanmarz > > http://nathanmarz.com > > > > >
