Re: batch and partition - differences

churly lin Sat, 25 Jan 2014 23:08:08 -0800

Thank you again Laurent ^ ^


2014-01-22 Laurent Thoulon <[email protected]>

> I'll advise you to take a look at this part of the wiki which should
> answer your question:
>
> https://github.com/nathanmarz/storm/wiki/Trident-API-Overview#operations-on-grouped-streams
>
> Laurent
> ------------------------------
> *De: *"churly lin" <[email protected]>
> *À: *"user" <[email protected]>
> *Envoyé: *Mercredi 22 Janvier 2014 11:15:38
> *Objet: *Re: batch and partition - differences
>
>
> +1.
> I have the same question after reading Laurent's nice answer.
>
>
> 2014/1/22 Susheel Kumar Gadalay <[email protected]>
>
>> Here, what is the difference of partition and group by?
>> A partition can have multiple group by?
>>
>> On 1/22/14, Laurent Thoulon <[email protected]> wrote:
>> > Hi Michal
>> >
>> > Indeed, i'll quote the wiki:
>> > Running aggregate on a Stream does a global aggregation.
>> > If you run aggregators on a grouped stream, the aggregation will be run
>> > within each group instead of against the whole batch.
>> > partitionAggregate runs a function on each partition of a batch of
>> tuples.
>> >
>> > So to, in the complete method of your Aggregator, depending on how you
>> > managed your stream, i understand you will :
>> > - Aggregate all the tuples of the batch if you just used .aggregate on a
>> > stream
>> > - Aggregate the tuples by group if you used .groupBy().aggregate()
>> > - Aggregate the tuples of the partition defined by storm if you used
>> > partitionAggregate()
>> > - Aggregate the tuples of the partition you defined if you use
>> > .partitionBy().partitionAggregate()
>> >
>> >
>> > Laurent
>> > ----- Mail original -----
>> >
>> > De: "Michal Singer" <[email protected]>
>> > À: [email protected]
>> > Envoyé: Mardi 21 Janvier 2014 11:47:05
>> > Objet: RE: batch and partition - differences
>> >
>> >
>> >
>> > Actually, isn’t the difference between aggregate and partitionAggregate
>> that
>> > aggregate runs on all the batch and partitionAggregate on the partition
>> of
>> > the batch?
>> > thanks
>> >
>> >
>> >
>> > From: Michal Singer [mailto: [email protected] ]
>> > Sent: Tuesday, January 21, 2014 12:44 PM
>> > To: ' [email protected] '
>> > Subject: RE: batch and partition - differences
>> >
>> > Thanks for the answer.
>> > I read the tutorials but I was still confused. The repartition is a
>> little
>> > confusing for me.
>> > You wrote: “ in the complete method of your Aggregator, only the tuples
>> in
>> > the current partion will have been aggregated”
>> > However, in the second link you added, it is mentioned that: “
>> aggregate is
>> > run on each batch of the stream in isolation,”
>> > So I am confused.
>> > Doesn’t the complete method run as mentioned in the link on all
>> batches? Or
>> > does it run on all parts of batches that run on the same partition?
>> >
>> > Thanks,
>> >
>> >
>> >
>> >
>> > From: Laurent Thoulon [ mailto:[email protected] ]
>> > Sent: Tuesday, January 21, 2014 12:13 PM
>> > To: [email protected]
>> > Subject: Re: batch and partition - differences
>> >
>> >
>> > To undertsand this, if you haven't already read those, i'd advise you to
>> > take a look at
>> > https://github.com/nathanmarz/storm/wiki/Trident-tutorial
>> > then
>> > https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
>> >
>> > To sumarize this and reuse the words of Nathan, you will have for
>> instance
>> > 100 tuples in one batch. Then partionning kicks in and you may have one
>> > partition with 30 tuples, an other with 50 and a last one with 20 that
>> may
>> > be parallelized.
>> >
>> > If you use .partitionBy, you will define the "key fields" of your
>> > partionning. For instance partitionBy(new Fields("id")) will ensure
>> that all
>> > the tuples from the batch that have a matching "id" field be grouped in
>> the
>> > same partition.
>> > Aggregation and groupBy uses that mechanism as it's necessary to have
>> the
>> > whole set of tuples that you want to aggregate together.
>> >
>> > That beeing said, you should understand that in the complete method of
>> your
>> > Aggregator, only the tuples in the current partion will have been
>> > aggregated. You can still use .global() or .batchGlobal() to aggregate
>> all
>> > the tuples of the batch into one partition.
>> >
>> >
>> > Laurent
>> >
>> >
>> >
>> > De: "Michal Singer" < [email protected] >
>> > À: [email protected]
>> > Envoyé: Mardi 21 Janvier 2014 08:37:59
>> > Objet: RE: batch and partition - differences
>> > So a batch can be divided into multiple partitions? And then for
>> example a
>> > aggregator will aggregate all the tuples in the batch in the complete
>> > method?
>> > thanks
>> >
>> > From: [email protected] [mailto: [email protected] ] On Behalf
>> Of
>> > Nathan Marz
>> > Sent: Tuesday, January 21, 2014 7:06 AM
>> > To: [email protected]
>> > Subject: Re: batch and partition - differences
>> >
>> >
>> > A batch is all the tuples being computed on at once each run of the
>> > topology. Each stage of the processing is split into partitions for
>> > parallelization.
>> >
>> >
>> >
>> > On Mon, Jan 20, 2014 at 3:31 AM, Michal Singer < [email protected] >
>> > wrote:
>> >
>> >
>> > Hi, it is not clear what is the different between batch and partition
>> on the
>> > Trident.
>> > Is partition the task that the batch is performed on?
>> > Can someone explain the difference?
>> > thanks
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> >
>> > Twitter: @nathanmarz
>> > http://nathanmarz.com
>> >
>> >
>>
>
>
>

Re: batch and partition - differences

Reply via email to