Re: Syncing multiple streams to compute final result from a bolt

Harsh Choudhary Tue, 20 Sep 2016 09:39:27 -0700

My use case is that I have a json which contains an array. I need to split
that array into multiple jsons and do some computations on them. After
that, results from each json has to be used in further calculation
altogether and come up with the final result.


*Cheers!*

Harsh Choudhary / Software Engineer

Blog / express.harshti.me

[image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter]
<https://twitter.com/har_ssh> [image: Google Plus]
<https://plus.google.com/107567038912927268680>
<https://in.linkedin.com/in/choudharyharsh> [image: Linkedin]
<https://in.linkedin.com/in/choudharyharsh> [image: Instagram]
<https://instagram.com/harsh.choudhary>
<https://www.pinterest.com/shryharsh/>[image: 500px]
<https://500px.com/harshchoudhary> [image: github]
<https://github.com/shry15harsh>

On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <asharma52...@gmail.com>
wrote:

> What's your use case?
>
> The complexities can be manage d as long as your tuple branching is
> reasonable I.e. 1 tuple creates several other tuples and you need to sync
> results between them.
>
> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:
>
>> You're right. For that I have to manage the queue and all those
>> complexities of timeout. If Storm is not the right place to do this then
>> what else?
>>
>>
>>
>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <asharma52...@gmail.com>
>> wrote:
>>
>>> The correct way is to perform time window aggregation using bucketing.
>>>
>>> Use the timestamp on your event computed from.various stages and send it
>>> to a single bolt where the aggregation happens. You only emit from this
>>> bolt once you receive results from both parts.
>>>
>>> It's like creating a barrier or the join phase of a fork join pool.
>>>
>>> That said the more important question is is Storm the right place do to
>>> this? When you perform time window aggregation you are susceptible to tuple
>>> timeouts and have to also deal with making sure your aggregation is
>>> idempotent.
>>>
>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:
>>>
>>>> But how would that solve the syncing problem?
>>>>
>>>>
>>>>
>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>>> alberto....@gmail.com> wrote:
>>>>
>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>>>>> have a separate workflow with another spout and Bolt-B draining from there
>>>>>
>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <shry.ha...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am thinking of doing the following.
>>>>>>
>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>>>>> individual tuples.
>>>>>>
>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>>>>>> from a json and emits them as multiple streams.
>>>>>>
>>>>>> Bolt-B receives these streams and do the computation on them.
>>>>>>
>>>>>> I need to make a cumulative result from all the multiple JSONs (which
>>>>>> are emerged from a single JSON) in a Bolt. But a bolt static instance
>>>>>> variable is only shared between tasks per worker. How do achieve this
>>>>>> syncing process.
>>>>>>
>>>>>>                               --->
>>>>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>>>>                               --->
>>>>>>
>>>>>> The final result is per JSON which was read from Kafka.
>>>>>>
>>>>>> Or is there any other way to achieve this better?
>>>>>>
>>>>>
>>>>>
>>>>
>>

Re: Syncing multiple streams to compute final result from a bolt

Reply via email to