Hi Klaus, I've been dealing with similar use cases. I do a couple of things (which may not be a final solution, but it is interesting to discuss alternate approaches): I have passed trained models in the 200MB range through storm, but I try to avoid it. The model gets dropped into persistence and then only ID to the model is passed through the topology. So my training bolt passes the whole model blob to the persistence bolt and that's it...in the future I may even remove that step so that the model blob never gets transferred by storm. Also, I use separate topologies for training, and those tend to have timeouts much higher because the "train" aggregator can take quite a while. Traditionally this would probably happen in Hadoop or some other batch system, but I'm too busy to do the setup and storm is handling it fine anyway.
I don't have to do any polling because I have model selection running as a logically different step, i.e. tuple shows up for prediction, run a selection step which finds the model ID for scoring that tuple, then it flows on to an actual scoring bolt which retrieves the model based on ID and applies it to the tuple. If the creation of a new model leads you to re-score "old" tuples, you could use the model write to trigger those tuples to be replayed from some source of state such that they will pickup the new model ID and proceed as normal. Best, Adam On Wed, Feb 26, 2014 at 7:54 AM, Klausen Schaefersinho < [email protected]> wrote: > THX, > > the idea is good, I will keep that in mind. The only drawback is that it > relies on polling, what I do not like to much in the PredictionBolt. Off > couse I could also pass S3 or File refernces around in the messages, to > trigger an update. But for the sake of simplicity I was thinking of keeping > everything in storm and do not rely if possible on other system. > > Cheers, > > Klaus > > > On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji <[email protected]> wrote: > >> I can't comment on how large tuples fare, but about the synchronization, >> would this not make more sense? >> >> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt >> | | >> \/ | >> Agg. State | >> /\ | >> | V >> TrainingBolt -----> Model State >> >> I.e. AggregationBolt writes to AggregationState, which is polled by >> TrainingBolt, which writes to ModelState. ModelState is then polled by >> PredictionBolt. >> >> This way, you can get rid of the large tuples as well and use instead >> something like S3 for these large states. >> >> >> >> >> >> On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho < >> [email protected]> wrote: >> >>> Hi, >>> >>> I have a topology which process events and aggregates them in some form >>> and performs some prediction based on a machine learning (ML) model. Every >>> x events the one of the bolt involved in the normal processing emit an >>> "trainModel" event, which is routed to a bolt which is just dedicated to >>> the training. One the training is done, the new model should be send back >>> to the prediction bolt. The topology looks like: >>> >>> >>> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt >>> | /\ >>> \/ | >>> TrainingBolt -------------+ >>> >>> >>> The model can get quite large (> 100 mb) so I am not sure how this would >>> impact the performance of my cluster. Does anybody has experiences with >>> transmitting large messages? >>> >>> Also the training might take a while, so the aggregation bolt should not >>> trigger the training bolt if he is busy. Is there an established patterns >>> how to archive this kind of synchronization? I could have some streams to >>> send states, but then I would mix data stream with control stream, what I >>> really would like to avoid. An alternative would be use ZooKeeper and >>> perform the synchronization there. Lats but not least I could also make >>> make the aggregation bolt into a data base and have the training bolt >>> periodically wake up and read the data base. Does anybody has experience >>> with such a setup? >>> >>> Kind Regards, >>> >>> Klaus >>> >>> >> >
