Thank you again Laurent ^ ^
2014-01-22 Laurent Thoulon <[email protected]> > I'll advise you to take a look at this part of the wiki which should > answer your question: > > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview#operations-on-grouped-streams > > Laurent > ------------------------------ > *De: *"churly lin" <[email protected]> > *À: *"user" <[email protected]> > *Envoyé: *Mercredi 22 Janvier 2014 11:15:38 > *Objet: *Re: batch and partition - differences > > > +1. > I have the same question after reading Laurent's nice answer. > > > 2014/1/22 Susheel Kumar Gadalay <[email protected]> > >> Here, what is the difference of partition and group by? >> A partition can have multiple group by? >> >> On 1/22/14, Laurent Thoulon <[email protected]> wrote: >> > Hi Michal >> > >> > Indeed, i'll quote the wiki: >> > Running aggregate on a Stream does a global aggregation. >> > If you run aggregators on a grouped stream, the aggregation will be run >> > within each group instead of against the whole batch. >> > partitionAggregate runs a function on each partition of a batch of >> tuples. >> > >> > So to, in the complete method of your Aggregator, depending on how you >> > managed your stream, i understand you will : >> > - Aggregate all the tuples of the batch if you just used .aggregate on a >> > stream >> > - Aggregate the tuples by group if you used .groupBy().aggregate() >> > - Aggregate the tuples of the partition defined by storm if you used >> > partitionAggregate() >> > - Aggregate the tuples of the partition you defined if you use >> > .partitionBy().partitionAggregate() >> > >> > >> > Laurent >> > ----- Mail original ----- >> > >> > De: "Michal Singer" <[email protected]> >> > À: [email protected] >> > Envoyé: Mardi 21 Janvier 2014 11:47:05 >> > Objet: RE: batch and partition - differences >> > >> > >> > >> > Actually, isn’t the difference between aggregate and partitionAggregate >> that >> > aggregate runs on all the batch and partitionAggregate on the partition >> of >> > the batch? >> > thanks >> > >> > >> > >> > From: Michal Singer [mailto: [email protected] ] >> > Sent: Tuesday, January 21, 2014 12:44 PM >> > To: ' [email protected] ' >> > Subject: RE: batch and partition - differences >> > >> > Thanks for the answer. >> > I read the tutorials but I was still confused. The repartition is a >> little >> > confusing for me. >> > You wrote: “ in the complete method of your Aggregator, only the tuples >> in >> > the current partion will have been aggregated” >> > However, in the second link you added, it is mentioned that: “ >> aggregate is >> > run on each batch of the stream in isolation,” >> > So I am confused. >> > Doesn’t the complete method run as mentioned in the link on all >> batches? Or >> > does it run on all parts of batches that run on the same partition? >> > >> > Thanks, >> > >> > >> > >> > >> > From: Laurent Thoulon [ mailto:[email protected] ] >> > Sent: Tuesday, January 21, 2014 12:13 PM >> > To: [email protected] >> > Subject: Re: batch and partition - differences >> > >> > >> > To undertsand this, if you haven't already read those, i'd advise you to >> > take a look at >> > https://github.com/nathanmarz/storm/wiki/Trident-tutorial >> > then >> > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview >> > >> > To sumarize this and reuse the words of Nathan, you will have for >> instance >> > 100 tuples in one batch. Then partionning kicks in and you may have one >> > partition with 30 tuples, an other with 50 and a last one with 20 that >> may >> > be parallelized. >> > >> > If you use .partitionBy, you will define the "key fields" of your >> > partionning. For instance partitionBy(new Fields("id")) will ensure >> that all >> > the tuples from the batch that have a matching "id" field be grouped in >> the >> > same partition. >> > Aggregation and groupBy uses that mechanism as it's necessary to have >> the >> > whole set of tuples that you want to aggregate together. >> > >> > That beeing said, you should understand that in the complete method of >> your >> > Aggregator, only the tuples in the current partion will have been >> > aggregated. You can still use .global() or .batchGlobal() to aggregate >> all >> > the tuples of the batch into one partition. >> > >> > >> > Laurent >> > >> > >> > >> > De: "Michal Singer" < [email protected] > >> > À: [email protected] >> > Envoyé: Mardi 21 Janvier 2014 08:37:59 >> > Objet: RE: batch and partition - differences >> > So a batch can be divided into multiple partitions? And then for >> example a >> > aggregator will aggregate all the tuples in the batch in the complete >> > method? >> > thanks >> > >> > From: [email protected] [mailto: [email protected] ] On Behalf >> Of >> > Nathan Marz >> > Sent: Tuesday, January 21, 2014 7:06 AM >> > To: [email protected] >> > Subject: Re: batch and partition - differences >> > >> > >> > A batch is all the tuples being computed on at once each run of the >> > topology. Each stage of the processing is split into partitions for >> > parallelization. >> > >> > >> > >> > On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] > >> > wrote: >> > >> > >> > Hi, it is not clear what is the different between batch and partition >> on the >> > Trident. >> > Is partition the task that the batch is performed on? >> > Can someone explain the difference? >> > thanks >> > >> > >> > >> > >> > >> > -- >> > >> > >> > Twitter: @nathanmarz >> > http://nathanmarz.com >> > >> > >> > > >
