Hi Michal 

Indeed, i'll quote the wiki: 
Running aggregate on a Stream does a global aggregation. 
If you run aggregators on a grouped stream, the aggregation will be run within 
each group instead of against the whole batch. 
partitionAggregate runs a function on each partition of a batch of tuples. 

So to, in the complete method of your Aggregator, depending on how you managed 
your stream, i understand you will : 
- Aggregate all the tuples of the batch if you just used .aggregate on a stream 
- Aggregate the tuples by group if you used .groupBy().aggregate() 
- Aggregate the tuples of the partition defined by storm if you used 
partitionAggregate() 
- Aggregate the tuples of the partition you defined if you use 
.partitionBy().partitionAggregate() 


Laurent 
----- Mail original -----

De: "Michal Singer" <[email protected]> 
À: [email protected] 
Envoyé: Mardi 21 Janvier 2014 11:47:05 
Objet: RE: batch and partition - differences 



Actually, isn’t the difference between aggregate and partitionAggregate that 
aggregate runs on all the batch and partitionAggregate on the partition of the 
batch? 
thanks 



From: Michal Singer [mailto: [email protected] ] 
Sent: Tuesday, January 21, 2014 12:44 PM 
To: ' [email protected] ' 
Subject: RE: batch and partition - differences 

Thanks for the answer. 
I read the tutorials but I was still confused. The repartition is a little 
confusing for me. 
You wrote: “ in the complete method of your Aggregator, only the tuples in the 
current partion will have been aggregated” 
However, in the second link you added, it is mentioned that: “ aggregate is run 
on each batch of the stream in isolation,” 
So I am confused. 
Doesn’t the complete method run as mentioned in the link on all batches? Or 
does it run on all parts of batches that run on the same partition? 

Thanks, 




From: Laurent Thoulon [ mailto:[email protected] ] 
Sent: Tuesday, January 21, 2014 12:13 PM 
To: [email protected] 
Subject: Re: batch and partition - differences 


To undertsand this, if you haven't already read those, i'd advise you to take a 
look at 
https://github.com/nathanmarz/storm/wiki/Trident-tutorial 
then 
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview 

To sumarize this and reuse the words of Nathan, you will have for instance 100 
tuples in one batch. Then partionning kicks in and you may have one partition 
with 30 tuples, an other with 50 and a last one with 20 that may be 
parallelized. 

If you use .partitionBy, you will define the "key fields" of your partionning. 
For instance partitionBy(new Fields("id")) will ensure that all the tuples from 
the batch that have a matching "id" field be grouped in the same partition. 
Aggregation and groupBy uses that mechanism as it's necessary to have the whole 
set of tuples that you want to aggregate together. 

That beeing said, you should understand that in the complete method of your 
Aggregator, only the tuples in the current partion will have been aggregated. 
You can still use .global() or .batchGlobal() to aggregate all the tuples of 
the batch into one partition. 


Laurent 



De: "Michal Singer" < [email protected] > 
À: [email protected] 
Envoyé: Mardi 21 Janvier 2014 08:37:59 
Objet: RE: batch and partition - differences 
So a batch can be divided into multiple partitions? And then for example a 
aggregator will aggregate all the tuples in the batch in the complete method? 
thanks 

From: [email protected] [mailto: [email protected] ] On Behalf Of 
Nathan Marz 
Sent: Tuesday, January 21, 2014 7:06 AM 
To: [email protected] 
Subject: Re: batch and partition - differences 


A batch is all the tuples being computed on at once each run of the topology. 
Each stage of the processing is split into partitions for parallelization. 



On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] > wrote: 


Hi, it is not clear what is the different between batch and partition on the 
Trident. 
Is partition the task that the batch is performed on? 
Can someone explain the difference? 
thanks 





-- 


Twitter: @nathanmarz 
http://nathanmarz.com 

Reply via email to