Is this real-time or batch? If batch this is perfect for MapReduce or Spark.
If real-time then you should use Spark or Storm Trident. On Sep 20, 2016 9:39 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote: > My use case is that I have a json which contains an array. I need to split > that array into multiple jsons and do some computations on them. After > that, results from each json has to be used in further calculation > altogether and come up with the final result. > > *Cheers!* > > Harsh Choudhary / Software Engineer > > Blog / express.harshti.me > > [image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter] > <https://twitter.com/har_ssh> [image: Google Plus] > <https://plus.google.com/107567038912927268680> > <https://in.linkedin.com/in/choudharyharsh> [image: Linkedin] > <https://in.linkedin.com/in/choudharyharsh> [image: Instagram] > <https://instagram.com/harsh.choudhary> > <https://www.pinterest.com/shryharsh/>[image: 500px] > <https://500px.com/harshchoudhary> [image: github] > <https://github.com/shry15harsh> > > On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <asharma52...@gmail.com> > wrote: > >> What's your use case? >> >> The complexities can be manage d as long as your tuple branching is >> reasonable I.e. 1 tuple creates several other tuples and you need to sync >> results between them. >> >> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote: >> >>> You're right. For that I have to manage the queue and all those >>> complexities of timeout. If Storm is not the right place to do this then >>> what else? >>> >>> >>> >>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <asharma52...@gmail.com> >>> wrote: >>> >>>> The correct way is to perform time window aggregation using bucketing. >>>> >>>> Use the timestamp on your event computed from.various stages and send >>>> it to a single bolt where the aggregation happens. You only emit from this >>>> bolt once you receive results from both parts. >>>> >>>> It's like creating a barrier or the join phase of a fork join pool. >>>> >>>> That said the more important question is is Storm the right place do to >>>> this? When you perform time window aggregation you are susceptible to tuple >>>> timeouts and have to also deal with making sure your aggregation is >>>> idempotent. >>>> >>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com> >>>> wrote: >>>> >>>>> But how would that solve the syncing problem? >>>>> >>>>> >>>>> >>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos < >>>>> alberto....@gmail.com> wrote: >>>>> >>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and >>>>>> have a separate workflow with another spout and Bolt-B draining from >>>>>> there >>>>>> >>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary < >>>>>> shry.ha...@gmail.com> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I am thinking of doing the following. >>>>>>> >>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as >>>>>>> individual tuples. >>>>>>> >>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs >>>>>>> from a json and emits them as multiple streams. >>>>>>> >>>>>>> Bolt-B receives these streams and do the computation on them. >>>>>>> >>>>>>> I need to make a cumulative result from all the multiple JSONs >>>>>>> (which are emerged from a single JSON) in a Bolt. But a bolt static >>>>>>> instance variable is only shared between tasks per worker. How do >>>>>>> achieve >>>>>>> this syncing process. >>>>>>> >>>>>>> ---> >>>>>>> Spout ---> Bolt-A ---> Bolt-B ---> Final result >>>>>>> ---> >>>>>>> >>>>>>> The final result is per JSON which was read from Kafka. >>>>>>> >>>>>>> Or is there any other way to achieve this better? >>>>>>> >>>>>> >>>>>> >>>>> >>> >