Re: Syncing multiple streams to compute final result from a bolt

Ambud Sharma Tue, 20 Sep 2016 15:45:38 -0700

Is this real-time or batch?

If batch this is perfect for MapReduce or Spark.


If real-time then you should use Spark or Storm Trident.

On Sep 20, 2016 9:39 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:

> My use case is that I have a json which contains an array. I need to split
> that array into multiple jsons and do some computations on them. After
> that, results from each json has to be used in further calculation
> altogether and come up with the final result.
>
> *Cheers!*
>
> Harsh Choudhary / Software Engineer
>
> Blog / express.harshti.me
>
> [image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter]
> <https://twitter.com/har_ssh> [image: Google Plus]
> <https://plus.google.com/107567038912927268680>
> <https://in.linkedin.com/in/choudharyharsh> [image: Linkedin]
> <https://in.linkedin.com/in/choudharyharsh> [image: Instagram]
> <https://instagram.com/harsh.choudhary>
> <https://www.pinterest.com/shryharsh/>[image: 500px]
> <https://500px.com/harshchoudhary> [image: github]
> <https://github.com/shry15harsh>
>
> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <asharma52...@gmail.com>
> wrote:
>
>> What's your use case?
>>
>> The complexities can be manage d as long as your tuple branching is
>> reasonable I.e. 1 tuple creates several other tuples and you need to sync
>> results between them.
>>
>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:
>>
>>> You're right. For that I have to manage the queue and all those
>>> complexities of timeout. If Storm is not the right place to do this then
>>> what else?
>>>
>>>
>>>
>>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <asharma52...@gmail.com>
>>> wrote:
>>>
>>>> The correct way is to perform time window aggregation using bucketing.
>>>>
>>>> Use the timestamp on your event computed from.various stages and send
>>>> it to a single bolt where the aggregation happens. You only emit from this
>>>> bolt once you receive results from both parts.
>>>>
>>>> It's like creating a barrier or the join phase of a fork join pool.
>>>>
>>>> That said the more important question is is Storm the right place do to
>>>> this? When you perform time window aggregation you are susceptible to tuple
>>>> timeouts and have to also deal with making sure your aggregation is
>>>> idempotent.
>>>>
>>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com>
>>>> wrote:
>>>>
>>>>> But how would that solve the syncing problem?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>>>> alberto....@gmail.com> wrote:
>>>>>
>>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>>>>>> have a separate workflow with another spout and Bolt-B draining from 
>>>>>> there
>>>>>>
>>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
>>>>>> shry.ha...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I am thinking of doing the following.
>>>>>>>
>>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>>>>>> individual tuples.
>>>>>>>
>>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>>>>>>> from a json and emits them as multiple streams.
>>>>>>>
>>>>>>> Bolt-B receives these streams and do the computation on them.
>>>>>>>
>>>>>>> I need to make a cumulative result from all the multiple JSONs
>>>>>>> (which are emerged from a single JSON) in a Bolt. But a bolt static
>>>>>>> instance variable is only shared between tasks per worker. How do 
>>>>>>> achieve
>>>>>>> this syncing process.
>>>>>>>
>>>>>>>                               --->
>>>>>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>>>>>                               --->
>>>>>>>
>>>>>>> The final result is per JSON which was read from Kafka.
>>>>>>>
>>>>>>> Or is there any other way to achieve this better?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: Syncing multiple streams to compute final result from a bolt

Reply via email to