Re: Syncing multiple streams to compute final result from a bolt

2016-09-23 Thread Harsh Choudhary
Thanks for all the help. :)



On Wed, Sep 21, 2016 at 11:56 AM, Harsh Choudhary 
wrote:

> It is real-time. I get streaming JSONs from Kafka.
>
>
>
>
> On Wed, Sep 21, 2016 at 4:15 AM, Ambud Sharma 
> wrote:
>
>> Is this real-time or batch?
>>
>> If batch this is perfect for MapReduce or Spark.
>>
>> If real-time then you should use Spark or Storm Trident.
>>
>> On Sep 20, 2016 9:39 AM, "Harsh Choudhary"  wrote:
>>
>>> My use case is that I have a json which contains an array. I need to
>>> split that array into multiple jsons and do some computations on them.
>>> After that, results from each json has to be used in further calculation
>>> altogether and come up with the final result.
>>>
>>> *Cheers!*
>>>
>>> Harsh Choudhary / Software Engineer
>>>
>>> Blog / express.harshti.me
>>>
>>> [image: Facebook]  [image: Twitter]
>>>  [image: Google Plus]
>>> 
>>>  [image: Linkedin]
>>>  [image: Instagram]
>>> 
>>> [image: 500px]
>>>  [image: github]
>>> 
>>>
>>> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma 
>>> wrote:
>>>
 What's your use case?

 The complexities can be manage d as long as your tuple branching is
 reasonable I.e. 1 tuple creates several other tuples and you need to sync
 results between them.

 On Sep 20, 2016 8:19 AM, "Harsh Choudhary" 
 wrote:

> You're right. For that I have to manage the queue and all those
> complexities of timeout. If Storm is not the right place to do this then
> what else?
>
>
>
> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
> wrote:
>
>> The correct way is to perform time window aggregation using bucketing.
>>
>> Use the timestamp on your event computed from.various stages and send
>> it to a single bolt where the aggregation happens. You only emit from 
>> this
>> bolt once you receive results from both parts.
>>
>> It's like creating a barrier or the join phase of a fork join pool.
>>
>> That said the more important question is is Storm the right place do
>> to this? When you perform time window aggregation you are susceptible to
>> tuple timeouts and have to also deal with making sure your aggregation is
>> idempotent.
>>
>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" 
>> wrote:
>>
>>> But how would that solve the syncing problem?
>>>
>>>
>>>
>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>> alberto@gmail.com> wrote:
>>>
 I would dump the *Bolt-A* results in a shared-data-store/queue and
 have a separate workflow with another spout and Bolt-B draining from 
 there

 On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
 shry.ha...@gmail.com> wrote:

> Hi
>
> I am thinking of doing the following.
>
> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
> individual tuples.
>
> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
> from a json and emits them as multiple streams.
>
> Bolt-B receives these streams and do the computation on them.
>
> I need to make a cumulative result from all the multiple JSONs
> (which are emerged from a single JSON) in a Bolt. But a bolt static
> instance variable is only shared between tasks per worker. How do 
> achieve
> this syncing process.
>
>   --->
> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>   --->
>
> The final result is per JSON which was read from Kafka.
>
> Or is there any other way to achieve this better?
>


>>>
>
>>>
>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-21 Thread Harsh Choudhary
It is real-time. I get streaming JSONs from Kafka.



On Wed, Sep 21, 2016 at 4:15 AM, Ambud Sharma 
wrote:

> Is this real-time or batch?
>
> If batch this is perfect for MapReduce or Spark.
>
> If real-time then you should use Spark or Storm Trident.
>
> On Sep 20, 2016 9:39 AM, "Harsh Choudhary"  wrote:
>
>> My use case is that I have a json which contains an array. I need to
>> split that array into multiple jsons and do some computations on them.
>> After that, results from each json has to be used in further calculation
>> altogether and come up with the final result.
>>
>> *Cheers!*
>>
>> Harsh Choudhary / Software Engineer
>>
>> Blog / express.harshti.me
>>
>> [image: Facebook]  [image: Twitter]
>>  [image: Google Plus]
>> 
>>  [image: Linkedin]
>>  [image: Instagram]
>> 
>> [image: 500px]
>>  [image: github]
>> 
>>
>> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma 
>> wrote:
>>
>>> What's your use case?
>>>
>>> The complexities can be manage d as long as your tuple branching is
>>> reasonable I.e. 1 tuple creates several other tuples and you need to sync
>>> results between them.
>>>
>>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary"  wrote:
>>>
 You're right. For that I have to manage the queue and all those
 complexities of timeout. If Storm is not the right place to do this then
 what else?



 On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
 wrote:

> The correct way is to perform time window aggregation using bucketing.
>
> Use the timestamp on your event computed from.various stages and send
> it to a single bolt where the aggregation happens. You only emit from this
> bolt once you receive results from both parts.
>
> It's like creating a barrier or the join phase of a fork join pool.
>
> That said the more important question is is Storm the right place do
> to this? When you perform time window aggregation you are susceptible to
> tuple timeouts and have to also deal with making sure your aggregation is
> idempotent.
>
> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" 
> wrote:
>
>> But how would that solve the syncing problem?
>>
>>
>>
>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>> alberto@gmail.com> wrote:
>>
>>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>>> have a separate workflow with another spout and Bolt-B draining from 
>>> there
>>>
>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
>>> shry.ha...@gmail.com> wrote:
>>>
 Hi

 I am thinking of doing the following.

 Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
 individual tuples.

 Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
 from a json and emits them as multiple streams.

 Bolt-B receives these streams and do the computation on them.

 I need to make a cumulative result from all the multiple JSONs
 (which are emerged from a single JSON) in a Bolt. But a bolt static
 instance variable is only shared between tasks per worker. How do 
 achieve
 this syncing process.

   --->
 Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
   --->

 The final result is per JSON which was read from Kafka.

 Or is there any other way to achieve this better?

>>>
>>>
>>

>>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Ambud Sharma
Is this real-time or batch?

If batch this is perfect for MapReduce or Spark.

If real-time then you should use Spark or Storm Trident.

On Sep 20, 2016 9:39 AM, "Harsh Choudhary"  wrote:

> My use case is that I have a json which contains an array. I need to split
> that array into multiple jsons and do some computations on them. After
> that, results from each json has to be used in further calculation
> altogether and come up with the final result.
>
> *Cheers!*
>
> Harsh Choudhary / Software Engineer
>
> Blog / express.harshti.me
>
> [image: Facebook]  [image: Twitter]
>  [image: Google Plus]
> 
>  [image: Linkedin]
>  [image: Instagram]
> 
> [image: 500px]
>  [image: github]
> 
>
> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma 
> wrote:
>
>> What's your use case?
>>
>> The complexities can be manage d as long as your tuple branching is
>> reasonable I.e. 1 tuple creates several other tuples and you need to sync
>> results between them.
>>
>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary"  wrote:
>>
>>> You're right. For that I have to manage the queue and all those
>>> complexities of timeout. If Storm is not the right place to do this then
>>> what else?
>>>
>>>
>>>
>>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
>>> wrote:
>>>
 The correct way is to perform time window aggregation using bucketing.

 Use the timestamp on your event computed from.various stages and send
 it to a single bolt where the aggregation happens. You only emit from this
 bolt once you receive results from both parts.

 It's like creating a barrier or the join phase of a fork join pool.

 That said the more important question is is Storm the right place do to
 this? When you perform time window aggregation you are susceptible to tuple
 timeouts and have to also deal with making sure your aggregation is
 idempotent.

 On Sep 20, 2016 7:49 AM, "Harsh Choudhary" 
 wrote:

> But how would that solve the syncing problem?
>
>
>
> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
> alberto@gmail.com> wrote:
>
>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>> have a separate workflow with another spout and Bolt-B draining from 
>> there
>>
>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
>> shry.ha...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am thinking of doing the following.
>>>
>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>> individual tuples.
>>>
>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>>> from a json and emits them as multiple streams.
>>>
>>> Bolt-B receives these streams and do the computation on them.
>>>
>>> I need to make a cumulative result from all the multiple JSONs
>>> (which are emerged from a single JSON) in a Bolt. But a bolt static
>>> instance variable is only shared between tasks per worker. How do 
>>> achieve
>>> this syncing process.
>>>
>>>   --->
>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>   --->
>>>
>>> The final result is per JSON which was read from Kafka.
>>>
>>> Or is there any other way to achieve this better?
>>>
>>
>>
>
>>>
>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Harsh Choudhary
My use case is that I have a json which contains an array. I need to split
that array into multiple jsons and do some computations on them. After
that, results from each json has to be used in further calculation
altogether and come up with the final result.

*Cheers!*

Harsh Choudhary / Software Engineer

Blog / express.harshti.me

[image: Facebook]  [image: Twitter]
 [image: Google Plus]

 [image: Linkedin]
 [image: Instagram]

[image: 500px]
 [image: github]


On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma 
wrote:

> What's your use case?
>
> The complexities can be manage d as long as your tuple branching is
> reasonable I.e. 1 tuple creates several other tuples and you need to sync
> results between them.
>
> On Sep 20, 2016 8:19 AM, "Harsh Choudhary"  wrote:
>
>> You're right. For that I have to manage the queue and all those
>> complexities of timeout. If Storm is not the right place to do this then
>> what else?
>>
>>
>>
>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
>> wrote:
>>
>>> The correct way is to perform time window aggregation using bucketing.
>>>
>>> Use the timestamp on your event computed from.various stages and send it
>>> to a single bolt where the aggregation happens. You only emit from this
>>> bolt once you receive results from both parts.
>>>
>>> It's like creating a barrier or the join phase of a fork join pool.
>>>
>>> That said the more important question is is Storm the right place do to
>>> this? When you perform time window aggregation you are susceptible to tuple
>>> timeouts and have to also deal with making sure your aggregation is
>>> idempotent.
>>>
>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary"  wrote:
>>>
 But how would that solve the syncing problem?



 On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
 alberto@gmail.com> wrote:

> I would dump the *Bolt-A* results in a shared-data-store/queue and
> have a separate workflow with another spout and Bolt-B draining from there
>
> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary  > wrote:
>
>> Hi
>>
>> I am thinking of doing the following.
>>
>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>> individual tuples.
>>
>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>> from a json and emits them as multiple streams.
>>
>> Bolt-B receives these streams and do the computation on them.
>>
>> I need to make a cumulative result from all the multiple JSONs (which
>> are emerged from a single JSON) in a Bolt. But a bolt static instance
>> variable is only shared between tasks per worker. How do achieve this
>> syncing process.
>>
>>   --->
>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>   --->
>>
>> The final result is per JSON which was read from Kafka.
>>
>> Or is there any other way to achieve this better?
>>
>
>

>>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Ambud Sharma
What's your use case?

The complexities can be manage d as long as your tuple branching is
reasonable I.e. 1 tuple creates several other tuples and you need to sync
results between them.

On Sep 20, 2016 8:19 AM, "Harsh Choudhary"  wrote:

> You're right. For that I have to manage the queue and all those
> complexities of timeout. If Storm is not the right place to do this then
> what else?
>
>
>
> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
> wrote:
>
>> The correct way is to perform time window aggregation using bucketing.
>>
>> Use the timestamp on your event computed from.various stages and send it
>> to a single bolt where the aggregation happens. You only emit from this
>> bolt once you receive results from both parts.
>>
>> It's like creating a barrier or the join phase of a fork join pool.
>>
>> That said the more important question is is Storm the right place do to
>> this? When you perform time window aggregation you are susceptible to tuple
>> timeouts and have to also deal with making sure your aggregation is
>> idempotent.
>>
>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary"  wrote:
>>
>>> But how would that solve the syncing problem?
>>>
>>>
>>>
>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>> alberto@gmail.com> wrote:
>>>
 I would dump the *Bolt-A* results in a shared-data-store/queue and
 have a separate workflow with another spout and Bolt-B draining from there

 On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary 
 wrote:

> Hi
>
> I am thinking of doing the following.
>
> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
> individual tuples.
>
> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from
> a json and emits them as multiple streams.
>
> Bolt-B receives these streams and do the computation on them.
>
> I need to make a cumulative result from all the multiple JSONs (which
> are emerged from a single JSON) in a Bolt. But a bolt static instance
> variable is only shared between tasks per worker. How do achieve this
> syncing process.
>
>   --->
> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>   --->
>
> The final result is per JSON which was read from Kafka.
>
> Or is there any other way to achieve this better?
>


>>>
>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Harsh Choudhary
You're right. For that I have to manage the queue and all those
complexities of timeout. If Storm is not the right place to do this then
what else?



On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma 
wrote:

> The correct way is to perform time window aggregation using bucketing.
>
> Use the timestamp on your event computed from.various stages and send it
> to a single bolt where the aggregation happens. You only emit from this
> bolt once you receive results from both parts.
>
> It's like creating a barrier or the join phase of a fork join pool.
>
> That said the more important question is is Storm the right place do to
> this? When you perform time window aggregation you are susceptible to tuple
> timeouts and have to also deal with making sure your aggregation is
> idempotent.
>
> On Sep 20, 2016 7:49 AM, "Harsh Choudhary"  wrote:
>
>> But how would that solve the syncing problem?
>>
>>
>>
>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>> alberto@gmail.com> wrote:
>>
>>> I would dump the *Bolt-A* results in a shared-data-store/queue and have
>>> a separate workflow with another spout and Bolt-B draining from there
>>>
>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary 
>>> wrote:
>>>
 Hi

 I am thinking of doing the following.

 Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
 individual tuples.

 Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from
 a json and emits them as multiple streams.

 Bolt-B receives these streams and do the computation on them.

 I need to make a cumulative result from all the multiple JSONs (which
 are emerged from a single JSON) in a Bolt. But a bolt static instance
 variable is only shared between tasks per worker. How do achieve this
 syncing process.

   --->
 Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
   --->

 The final result is per JSON which was read from Kafka.

 Or is there any other way to achieve this better?

>>>
>>>
>>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Ambud Sharma
The correct way is to perform time window aggregation using bucketing.

Use the timestamp on your event computed from.various stages and send it to
a single bolt where the aggregation happens. You only emit from this bolt
once you receive results from both parts.

It's like creating a barrier or the join phase of a fork join pool.

That said the more important question is is Storm the right place do to
this? When you perform time window aggregation you are susceptible to tuple
timeouts and have to also deal with making sure your aggregation is
idempotent.

On Sep 20, 2016 7:49 AM, "Harsh Choudhary"  wrote:

> But how would that solve the syncing problem?
>
>
>
> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos  > wrote:
>
>> I would dump the *Bolt-A* results in a shared-data-store/queue and have
>> a separate workflow with another spout and Bolt-B draining from there
>>
>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary 
>> wrote:
>>
>>> Hi
>>>
>>> I am thinking of doing the following.
>>>
>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>> individual tuples.
>>>
>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from a
>>> json and emits them as multiple streams.
>>>
>>> Bolt-B receives these streams and do the computation on them.
>>>
>>> I need to make a cumulative result from all the multiple JSONs (which
>>> are emerged from a single JSON) in a Bolt. But a bolt static instance
>>> variable is only shared between tasks per worker. How do achieve this
>>> syncing process.
>>>
>>>   --->
>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>   --->
>>>
>>> The final result is per JSON which was read from Kafka.
>>>
>>> Or is there any other way to achieve this better?
>>>
>>
>>
>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Harsh Choudhary
But how would that solve the syncing problem?



On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos 
wrote:

> I would dump the *Bolt-A* results in a shared-data-store/queue and have a
> separate workflow with another spout and Bolt-B draining from there
>
> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary 
> wrote:
>
>> Hi
>>
>> I am thinking of doing the following.
>>
>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>> individual tuples.
>>
>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from a
>> json and emits them as multiple streams.
>>
>> Bolt-B receives these streams and do the computation on them.
>>
>> I need to make a cumulative result from all the multiple JSONs (which are
>> emerged from a single JSON) in a Bolt. But a bolt static instance variable
>> is only shared between tasks per worker. How do achieve this syncing
>> process.
>>
>>   --->
>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>   --->
>>
>> The final result is per JSON which was read from Kafka.
>>
>> Or is there any other way to achieve this better?
>>
>
>


Re: Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Alberto São Marcos
I would dump the *Bolt-A* results in a shared-data-store/queue and have a
separate workflow with another spout and Bolt-B draining from there

On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary 
wrote:

> Hi
>
> I am thinking of doing the following.
>
> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
> individual tuples.
>
> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from a
> json and emits them as multiple streams.
>
> Bolt-B receives these streams and do the computation on them.
>
> I need to make a cumulative result from all the multiple JSONs (which are
> emerged from a single JSON) in a Bolt. But a bolt static instance variable
> is only shared between tasks per worker. How do achieve this syncing
> process.
>
>   --->
> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>   --->
>
> The final result is per JSON which was read from Kafka.
>
> Or is there any other way to achieve this better?
>


Syncing multiple streams to compute final result from a bolt

2016-09-20 Thread Harsh Choudhary
Hi

I am thinking of doing the following.

Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
individual tuples.

Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs from a
json and emits them as multiple streams.

Bolt-B receives these streams and do the computation on them.

I need to make a cumulative result from all the multiple JSONs (which are
emerged from a single JSON) in a Bolt. But a bolt static instance variable
is only shared between tasks per worker. How do achieve this syncing
process.

  --->
Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
  --->

The final result is per JSON which was read from Kafka.

Or is there any other way to achieve this better?