Re: groupBy / persistentAggregate and "distribution"

Olivier Mallassi Wed, 01 Apr 2015 23:58:34 -0700

looks like this article
https://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
gave me some answers.


On Wed, Apr 1, 2015 at 5:15 PM, Olivier Mallassi <[email protected]
> wrote:

> Hi all
>
> I would like to better understand how groupBy/persistentAggregate works in
> a "distributed environment"
>
> I got the following topology (just an extract)
>
> Stream currStream = topology.newStream(this.getTopologyName(), jmsspout)
> ...
> currStream.groupBy(new Fields(A,B))
>
>  .persistentAggregate(myDistributedCacheStateFactory(hosts),
>                             new Fields(C),
>                             (CombinerAggregator) sum,
>                             new Fields("output"))
>                     .parallelismHint(parallHintValue);
>
> MyDistributedCacheStateFactory is an implementation of MapState (should
> work because of the groupBy) that persist into a distributed cache. to keep
> it simple, an entry is identified by the value of [A,B]
>
> So as far as I understand, my batch will be grouped by [A, B] and my
> CombinerAggregator will do a sum on all the objects C.
>
> 1/ correct me if I am wrong but in a distributed env, I will have several
> instances of MyDistributedCacheStateFactory executed in different JVM or
> threads
>
> 2/ Does Storm guarantee me that each "groupBy" will always go to the same
> "thread"? or not?
> if not, how can I ensure a "select for update where key=[A,B]"? how can I
> ensure I will not update the same entry of the distributed cache from two
> different threads (ie. a kind of race conditions)
>
> 3/ is it a better option to use groupBy().aggregate().partitionPersist()?
>
> Hope my questions are not too blur.
>
> Regards.
>

Re: groupBy / persistentAggregate and "distribution"

Reply via email to