Hi Klaus,

I've been dealing with similar use cases.  I do a couple of things (which
may not be a final solution, but it is interesting to discuss alternate
approaches): I have passed trained models in the 200MB range through storm,
but I try to avoid it. The model gets dropped into persistence and then
only ID to the model is passed through the topology.  So my training bolt
passes the whole model blob to the persistence bolt and that's it...in the
future I may even remove that step so that the model blob never gets
transferred by storm.  Also, I use separate topologies for training, and
those tend to have timeouts much higher because the "train" aggregator can
take quite a while.  Traditionally this would probably happen in Hadoop or
some other batch system, but I'm too busy to do the setup and storm is
handling it fine anyway.

I don't have to do any polling because I have model selection running as a
logically different step, i.e. tuple shows up for prediction, run a
selection step which finds the model ID for scoring that tuple, then it
flows on to an actual scoring bolt which retrieves the model based on ID
and applies it to the tuple.  If the creation of a new model leads you to
re-score "old" tuples, you could use the model write to trigger those
tuples to be replayed from some source of state such that they will pickup
the new model ID and proceed as normal.

Best,

Adam




On Wed, Feb 26, 2014 at 7:54 AM, Klausen Schaefersinho <
[email protected]> wrote:

> THX,
>
> the idea is good, I will keep that in mind. The only drawback is that it
> relies on polling, what I do not like to much in the PredictionBolt. Off
> couse I could also pass S3 or File refernces around in the messages, to
> trigger an update. But for the sake of simplicity I was thinking of keeping
> everything in storm and do not rely if possible on other system.
>
> Cheers,
>
> Klaus
>
>
> On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji <[email protected]> wrote:
>
>> I can't comment on how large tuples fare, but about the synchronization,
>> would this not make more sense?
>>
>> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>>              |             |
>>                           \/                           |
>>                        Agg. State                |
>>             /\             |
>>                           |                            V
>>                        TrainingBolt -----> Model State
>>
>> I.e. AggregationBolt writes to AggregationState, which is polled by
>> TrainingBolt, which writes to ModelState. ModelState is then polled by
>> PredictionBolt.
>>
>> This way, you can get rid of the large tuples as well and use instead
>> something like S3 for these large states.
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a topology which process events and aggregates them in some form
>>> and performs some prediction based on a machine learning (ML) model. Every
>>> x events the one of the bolt involved in the normal processing emit an
>>> "trainModel" event, which is routed to a bolt which is just dedicated to
>>> the training. One the training is done, the new model should be send back
>>> to the prediction bolt. The topology looks like:
>>>
>>>
>>>  InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>>>              |             /\
>>>                           \/                           |
>>>                        TrainingBolt -------------+
>>>
>>>
>>> The model can get quite large (> 100 mb) so I am not sure how this would
>>> impact the performance of my cluster.  Does anybody has experiences with
>>> transmitting large messages?
>>>
>>> Also the training might take a while, so the aggregation bolt should not
>>> trigger the training bolt if he is busy. Is there an established patterns
>>> how to archive this kind of synchronization? I could have some streams to
>>> send states, but then I would mix data stream with control stream, what I
>>> really would like to avoid. An alternative would be use ZooKeeper and
>>> perform the synchronization there. Lats but not least I could also make
>>> make the aggregation bolt into a data base and have the training bolt
>>> periodically wake up and read the data base. Does anybody has experience
>>> with such a setup?
>>>
>>> Kind Regards,
>>>
>>> Klaus
>>>
>>>
>>
>

Reply via email to