Re: Syncing multiple streams to compute final result from a bolt

Harsh Choudhary Thu, 22 Sep 2016 23:30:07 -0700

Thanks for all the help. :)



On Wed, Sep 21, 2016 at 11:56 AM, Harsh Choudhary <[email protected]>
wrote:

> It is real-time. I get streaming JSONs from Kafka.
>
>
>
>
> On Wed, Sep 21, 2016 at 4:15 AM, Ambud Sharma <[email protected]>
> wrote:
>
>> Is this real-time or batch?
>>
>> If batch this is perfect for MapReduce or Spark.
>>
>> If real-time then you should use Spark or Storm Trident.
>>
>> On Sep 20, 2016 9:39 AM, "Harsh Choudhary" <[email protected]> wrote:
>>
>>> My use case is that I have a json which contains an array. I need to
>>> split that array into multiple jsons and do some computations on them.
>>> After that, results from each json has to be used in further calculation
>>> altogether and come up with the final result.
>>>
>>> *Cheers!*
>>>
>>> Harsh Choudhary / Software Engineer
>>>
>>> Blog / express.harshti.me
>>>
>>> [image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter]
>>> <https://twitter.com/har_ssh> [image: Google Plus]
>>> <https://plus.google.com/107567038912927268680>
>>> <https://in.linkedin.com/in/choudharyharsh> [image: Linkedin]
>>> <https://in.linkedin.com/in/choudharyharsh> [image: Instagram]
>>> <https://instagram.com/harsh.choudhary>
>>> <https://www.pinterest.com/shryharsh/>[image: 500px]
>>> <https://500px.com/harshchoudhary> [image: github]
>>> <https://github.com/shry15harsh>
>>>
>>> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <[email protected]>
>>> wrote:
>>>
>>>> What's your use case?
>>>>
>>>> The complexities can be manage d as long as your tuple branching is
>>>> reasonable I.e. 1 tuple creates several other tuples and you need to sync
>>>> results between them.
>>>>
>>>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <[email protected]>
>>>> wrote:
>>>>
>>>>> You're right. For that I have to manage the queue and all those
>>>>> complexities of timeout. If Storm is not the right place to do this then
>>>>> what else?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The correct way is to perform time window aggregation using bucketing.
>>>>>>
>>>>>> Use the timestamp on your event computed from.various stages and send
>>>>>> it to a single bolt where the aggregation happens. You only emit from 
>>>>>> this
>>>>>> bolt once you receive results from both parts.
>>>>>>
>>>>>> It's like creating a barrier or the join phase of a fork join pool.
>>>>>>
>>>>>> That said the more important question is is Storm the right place do
>>>>>> to this? When you perform time window aggregation you are susceptible to
>>>>>> tuple timeouts and have to also deal with making sure your aggregation is
>>>>>> idempotent.
>>>>>>
>>>>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> But how would that solve the syncing problem?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>>>>>>>> have a separate workflow with another spout and Bolt-B draining from 
>>>>>>>> there
>>>>>>>>
>>>>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I am thinking of doing the following.
>>>>>>>>>
>>>>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>>>>>>>> individual tuples.
>>>>>>>>>
>>>>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>>>>>>>>> from a json and emits them as multiple streams.
>>>>>>>>>
>>>>>>>>> Bolt-B receives these streams and do the computation on them.
>>>>>>>>>
>>>>>>>>> I need to make a cumulative result from all the multiple JSONs
>>>>>>>>> (which are emerged from a single JSON) in a Bolt. But a bolt static
>>>>>>>>> instance variable is only shared between tasks per worker. How do 
>>>>>>>>> achieve
>>>>>>>>> this syncing process.
>>>>>>>>>
>>>>>>>>>                               --->
>>>>>>>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>>>>>>>                               --->
>>>>>>>>>
>>>>>>>>> The final result is per JSON which was read from Kafka.
>>>>>>>>>
>>>>>>>>> Or is there any other way to achieve this better?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Re: Syncing multiple streams to compute final result from a bolt

Reply via email to