THX, the idea is good, I will keep that in mind. The only drawback is that it relies on polling, what I do not like to much in the PredictionBolt. Off couse I could also pass S3 or File refernces around in the messages, to trigger an update. But for the sake of simplicity I was thinking of keeping everything in storm and do not rely if possible on other system.
Cheers, Klaus On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji <[email protected]> wrote: > I can't comment on how large tuples fare, but about the synchronization, > would this not make more sense? > > InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt > | | > \/ | > Agg. State | > /\ | > | V > TrainingBolt -----> Model State > > I.e. AggregationBolt writes to AggregationState, which is polled by > TrainingBolt, which writes to ModelState. ModelState is then polled by > PredictionBolt. > > This way, you can get rid of the large tuples as well and use instead > something like S3 for these large states. > > > > > > On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho < > [email protected]> wrote: > >> Hi, >> >> I have a topology which process events and aggregates them in some form >> and performs some prediction based on a machine learning (ML) model. Every >> x events the one of the bolt involved in the normal processing emit an >> "trainModel" event, which is routed to a bolt which is just dedicated to >> the training. One the training is done, the new model should be send back >> to the prediction bolt. The topology looks like: >> >> >> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt >> | /\ >> \/ | >> TrainingBolt -------------+ >> >> >> The model can get quite large (> 100 mb) so I am not sure how this would >> impact the performance of my cluster. Does anybody has experiences with >> transmitting large messages? >> >> Also the training might take a while, so the aggregation bolt should not >> trigger the training bolt if he is busy. Is there an established patterns >> how to archive this kind of synchronization? I could have some streams to >> send states, but then I would mix data stream with control stream, what I >> really would like to avoid. An alternative would be use ZooKeeper and >> perform the synchronization there. Lats but not least I could also make >> make the aggregation bolt into a data base and have the training bolt >> periodically wake up and read the data base. Does anybody has experience >> with such a setup? >> >> Kind Regards, >> >> Klaus >> >> >
