Re: batch and partition - differences

Laurent Thoulon Wed, 22 Jan 2014 07:43:17 -0800

I'll advise you to take a look at this part of the wiki which should answer 
your question: 
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview#operations-on-grouped-streams



Laurent 
----- Mail original -----

De: "churly lin" <[email protected]> 
À: "user" <[email protected]> 
Envoyé: Mercredi 22 Janvier 2014 11:15:38 
Objet: Re: batch and partition - differences 


+1. 
I have the same question after reading Laurent's nice answer. 



2014/1/22 Susheel Kumar Gadalay < [email protected] > 


Here, what is the difference of partition and group by? 
A partition can have multiple group by? 


On 1/22/14, Laurent Thoulon < [email protected] > wrote: 
> Hi Michal 
> 
> Indeed, i'll quote the wiki: 
> Running aggregate on a Stream does a global aggregation. 
> If you run aggregators on a grouped stream, the aggregation will be run 
> within each group instead of against the whole batch. 
> partitionAggregate runs a function on each partition of a batch of tuples. 
> 
> So to, in the complete method of your Aggregator, depending on how you 
> managed your stream, i understand you will : 
> - Aggregate all the tuples of the batch if you just used .aggregate on a 
> stream 
> - Aggregate the tuples by group if you used .groupBy().aggregate() 
> - Aggregate the tuples of the partition defined by storm if you used 
> partitionAggregate() 
> - Aggregate the tuples of the partition you defined if you use 
> .partitionBy().partitionAggregate() 
> 
> 
> Laurent 
> ----- Mail original ----- 


> 
> De: "Michal Singer" < [email protected] > 
> À: [email protected] 
> Envoyé: Mardi 21 Janvier 2014 11:47:05 
> Objet: RE: batch and partition - differences 
> 
> 
> 
> Actually, isn’t the difference between aggregate and partitionAggregate that 
> aggregate runs on all the batch and partitionAggregate on the partition of 
> the batch? 
> thanks 
> 
> 
> 
> From: Michal Singer [mailto: [email protected] ] 
> Sent: Tuesday, January 21, 2014 12:44 PM 
> To: ' [email protected] ' 
> Subject: RE: batch and partition - differences 
> 
> Thanks for the answer. 
> I read the tutorials but I was still confused. The repartition is a little 
> confusing for me. 
> You wrote: “ in the complete method of your Aggregator, only the tuples in 
> the current partion will have been aggregated” 
> However, in the second link you added, it is mentioned that: “ aggregate is 
> run on each batch of the stream in isolation,” 
> So I am confused. 
> Doesn’t the complete method run as mentioned in the link on all batches? Or 
> does it run on all parts of batches that run on the same partition? 
> 
> Thanks, 
> 
> 
> 
> 
> From: Laurent Thoulon [ mailto: [email protected] ] 
> Sent: Tuesday, January 21, 2014 12:13 PM 
> To: [email protected] 
> Subject: Re: batch and partition - differences 
> 
> 
> To undertsand this, if you haven't already read those, i'd advise you to 
> take a look at 
> https://github.com/nathanmarz/storm/wiki/Trident-tutorial 
> then 
> https://github.com/nathanmarz/storm/wiki/Trident-API-Overview 
> 
> To sumarize this and reuse the words of Nathan, you will have for instance 
> 100 tuples in one batch. Then partionning kicks in and you may have one 
> partition with 30 tuples, an other with 50 and a last one with 20 that may 
> be parallelized. 
> 
> If you use .partitionBy, you will define the "key fields" of your 
> partionning. For instance partitionBy(new Fields("id")) will ensure that all 
> the tuples from the batch that have a matching "id" field be grouped in the 
> same partition. 
> Aggregation and groupBy uses that mechanism as it's necessary to have the 
> whole set of tuples that you want to aggregate together. 
> 
> That beeing said, you should understand that in the complete method of your 
> Aggregator, only the tuples in the current partion will have been 
> aggregated. You can still use .global() or .batchGlobal() to aggregate all 
> the tuples of the batch into one partition. 
> 
> 
> Laurent 
> 
> 
> 


> De: "Michal Singer" < [email protected] > 
> À: [email protected] 
> Envoyé: Mardi 21 Janvier 2014 08:37:59 
> Objet: RE: batch and partition - differences 
> So a batch can be divided into multiple partitions? And then for example a 
> aggregator will aggregate all the tuples in the batch in the complete 
> method? 
> thanks 
> 
> From: [email protected] [mailto: [email protected] ] On Behalf Of 
> Nathan Marz 
> Sent: Tuesday, January 21, 2014 7:06 AM 
> To: [email protected] 
> Subject: Re: batch and partition - differences 
> 
> 
> A batch is all the tuples being computed on at once each run of the 
> topology. Each stage of the processing is split into partitions for 
> parallelization. 
> 
> 
> 
> On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] > 
> wrote: 
> 
> 
> Hi, it is not clear what is the different between batch and partition on the 
> Trident. 
> Is partition the task that the batch is performed on? 
> Can someone explain the difference? 
> thanks 
> 
> 
> 
> 
> 
> -- 
> 
> 
> Twitter: @nathanmarz 
> http://nathanmarz.com 
> 
>

Re: batch and partition - differences

Reply via email to