Hi,

I have a topology which process events and aggregates them in some form and
performs some prediction based on a machine learning (ML) model. Every x
events the one of the bolt involved in the normal processing emit an
"trainModel" event, which is routed to a bolt which is just dedicated to
the training. One the training is done, the new model should be send back
to the prediction bolt. The topology looks like:


InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
             |             /\
                          \/                           |
                       TrainingBolt -------------+


The model can get quite large (> 100 mb) so I am not sure how this would
impact the performance of my cluster.  Does anybody has experiences with
transmitting large messages?

Also the training might take a while, so the aggregation bolt should not
trigger the training bolt if he is busy. Is there an established patterns
how to archive this kind of synchronization? I could have some streams to
send states, but then I would mix data stream with control stream, what I
really would like to avoid. An alternative would be use ZooKeeper and
perform the synchronization there. Lats but not least I could also make
make the aggregation bolt into a data base and have the training bolt
periodically wake up and read the data base. Does anybody has experience
with such a setup?

Kind Regards,

Klaus

Reply via email to