The correct way is to perform time window aggregation using bucketing.

Use the timestamp on your event computed from.various stages and send it to
a single bolt where the aggregation happens. You only emit from this bolt
once you receive results from both parts.

It's like creating a barrier or the join phase of a fork join pool.

That said the more important question is is Storm the right place do to
this? When you perform time window aggregation you are susceptible to tuple
timeouts and have to also deal with making sure your aggregation is
idempotent.

On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:

> But how would that solve the syncing problem?
>
>
>
> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <alberto....@gmail.com
> > wrote:
>
>> I would dump the *Bolt-A* results in a shared-data-store/queue and have
>> a separate workflow with another spout and Bolt-B draining from there
>>
>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <shry.ha...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am thinking of doing the following.
>>>
>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>> individual tuples.
>>>
>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from a
>>> json and emits them as multiple streams.
>>>
>>> Bolt-B receives these streams and do the computation on them.
>>>
>>> I need to make a cumulative result from all the multiple JSONs (which
>>> are emerged from a single JSON) in a Bolt. But a bolt static instance
>>> variable is only shared between tasks per worker. How do achieve this
>>> syncing process.
>>>
>>>                               --->
>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>                               --->
>>>
>>> The final result is per JSON which was read from Kafka.
>>>
>>> Or is there any other way to achieve this better?
>>>
>>
>>
>

Reply via email to