I'll advise you to take a look at this part of the wiki which should answer your question: https://github.com/nathanmarz/storm/wiki/Trident-API-Overview#operations-on-grouped-streams
Laurent ----- Mail original ----- De: "churly lin" <[email protected]> À: "user" <[email protected]> Envoyé: Mercredi 22 Janvier 2014 11:15:38 Objet: Re: batch and partition - differences +1. I have the same question after reading Laurent's nice answer. 2014/1/22 Susheel Kumar Gadalay < [email protected] > Here, what is the difference of partition and group by? A partition can have multiple group by? On 1/22/14, Laurent Thoulon < [email protected] > wrote: > Hi Michal > > Indeed, i'll quote the wiki: > Running aggregate on a Stream does a global aggregation. > If you run aggregators on a grouped stream, the aggregation will be run > within each group instead of against the whole batch. > partitionAggregate runs a function on each partition of a batch of tuples. > > So to, in the complete method of your Aggregator, depending on how you > managed your stream, i understand you will : > - Aggregate all the tuples of the batch if you just used .aggregate on a > stream > - Aggregate the tuples by group if you used .groupBy().aggregate() > - Aggregate the tuples of the partition defined by storm if you used > partitionAggregate() > - Aggregate the tuples of the partition you defined if you use > .partitionBy().partitionAggregate() > > > Laurent > ----- Mail original ----- > > De: "Michal Singer" < [email protected] > > À: [email protected] > Envoyé: Mardi 21 Janvier 2014 11:47:05 > Objet: RE: batch and partition - differences > > > > Actually, isn’t the difference between aggregate and partitionAggregate that > aggregate runs on all the batch and partitionAggregate on the partition of > the batch? > thanks > > > > From: Michal Singer [mailto: [email protected] ] > Sent: Tuesday, January 21, 2014 12:44 PM > To: ' [email protected] ' > Subject: RE: batch and partition - differences > > Thanks for the answer. > I read the tutorials but I was still confused. The repartition is a little > confusing for me. > You wrote: “ in the complete method of your Aggregator, only the tuples in > the current partion will have been aggregated” > However, in the second link you added, it is mentioned that: “ aggregate is > run on each batch of the stream in isolation,” > So I am confused. > Doesn’t the complete method run as mentioned in the link on all batches? Or > does it run on all parts of batches that run on the same partition? > > Thanks, > > > > > From: Laurent Thoulon [ mailto: [email protected] ] > Sent: Tuesday, January 21, 2014 12:13 PM > To: [email protected] > Subject: Re: batch and partition - differences > > > To undertsand this, if you haven't already read those, i'd advise you to > take a look at > https://github.com/nathanmarz/storm/wiki/Trident-tutorial > then > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview > > To sumarize this and reuse the words of Nathan, you will have for instance > 100 tuples in one batch. Then partionning kicks in and you may have one > partition with 30 tuples, an other with 50 and a last one with 20 that may > be parallelized. > > If you use .partitionBy, you will define the "key fields" of your > partionning. For instance partitionBy(new Fields("id")) will ensure that all > the tuples from the batch that have a matching "id" field be grouped in the > same partition. > Aggregation and groupBy uses that mechanism as it's necessary to have the > whole set of tuples that you want to aggregate together. > > That beeing said, you should understand that in the complete method of your > Aggregator, only the tuples in the current partion will have been > aggregated. You can still use .global() or .batchGlobal() to aggregate all > the tuples of the batch into one partition. > > > Laurent > > > > De: "Michal Singer" < [email protected] > > À: [email protected] > Envoyé: Mardi 21 Janvier 2014 08:37:59 > Objet: RE: batch and partition - differences > So a batch can be divided into multiple partitions? And then for example a > aggregator will aggregate all the tuples in the batch in the complete > method? > thanks > > From: [email protected] [mailto: [email protected] ] On Behalf Of > Nathan Marz > Sent: Tuesday, January 21, 2014 7:06 AM > To: [email protected] > Subject: Re: batch and partition - differences > > > A batch is all the tuples being computed on at once each run of the > topology. Each stage of the processing is split into partitions for > parallelization. > > > > On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] > > wrote: > > > Hi, it is not clear what is the different between batch and partition on the > Trident. > Is partition the task that the batch is performed on? > Can someone explain the difference? > thanks > > > > > > -- > > > Twitter: @nathanmarz > http://nathanmarz.com > >
