THX Adam,

having a separate topology for training is a actually a very nice idea. I
will give it a try!

Cheers,

Klaus


On Wed, Feb 26, 2014 at 3:28 PM, Adam Lewis <[email protected]> wrote:

> Hi Klaus,
>
> I've been dealing with similar use cases.  I do a couple of things (which
> may not be a final solution, but it is interesting to discuss alternate
> approaches): I have passed trained models in the 200MB range through storm,
> but I try to avoid it. The model gets dropped into persistence and then
> only ID to the model is passed through the topology.  So my training bolt
> passes the whole model blob to the persistence bolt and that's it...in the
> future I may even remove that step so that the model blob never gets
> transferred by storm.  Also, I use separate topologies for training, and
> those tend to have timeouts much higher because the "train" aggregator can
> take quite a while.  Traditionally this would probably happen in Hadoop or
> some other batch system, but I'm too busy to do the setup and storm is
> handling it fine anyway.
>
> I don't have to do any polling because I have model selection running as a
> logically different step, i.e. tuple shows up for prediction, run a
> selection step which finds the model ID for scoring that tuple, then it
> flows on to an actual scoring bolt which retrieves the model based on ID
> and applies it to the tuple.  If the creation of a new model leads you to
> re-score "old" tuples, you could use the model write to trigger those
> tuples to be replayed from some source of state such that they will pickup
> the new model ID and proceed as normal.
>
> Best,
>
> Adam
>
>
>
>
> On Wed, Feb 26, 2014 at 7:54 AM, Klausen Schaefersinho <
> [email protected]> wrote:
>
>> THX,
>>
>> the idea is good, I will keep that in mind. The only drawback is that it
>> relies on polling, what I do not like to much in the PredictionBolt. Off
>> couse I could also pass S3 or File refernces around in the messages, to
>> trigger an update. But for the sake of simplicity I was thinking of keeping
>> everything in storm and do not rely if possible on other system.
>>
>> Cheers,
>>
>> Klaus
>>
>>
>> On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji <[email protected]> wrote:
>>
>>> I can't comment on how large tuples fare, but about the synchronization,
>>> would this not make more sense?
>>>
>>> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>>>              |             |
>>>                           \/                           |
>>>                        Agg. State                |
>>>             /\             |
>>>                           |                            V
>>>                        TrainingBolt -----> Model State
>>>
>>> I.e. AggregationBolt writes to AggregationState, which is polled by
>>> TrainingBolt, which writes to ModelState. ModelState is then polled by
>>> PredictionBolt.
>>>
>>> This way, you can get rid of the large tuples as well and use instead
>>> something like S3 for these large states.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a topology which process events and aggregates them in some form
>>>> and performs some prediction based on a machine learning (ML) model. Every
>>>> x events the one of the bolt involved in the normal processing emit an
>>>> "trainModel" event, which is routed to a bolt which is just dedicated to
>>>> the training. One the training is done, the new model should be send back
>>>> to the prediction bolt. The topology looks like:
>>>>
>>>>
>>>>  InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>>>>              |             /\
>>>>                           \/                           |
>>>>                        TrainingBolt -------------+
>>>>
>>>>
>>>> The model can get quite large (> 100 mb) so I am not sure how this
>>>> would impact the performance of my cluster.  Does anybody has experiences
>>>> with transmitting large messages?
>>>>
>>>> Also the training might take a while, so the aggregation bolt should
>>>> not trigger the training bolt if he is busy. Is there an established
>>>> patterns how to archive this kind of synchronization? I could have some
>>>> streams to send states, but then I would mix data stream with control
>>>> stream, what I really would like to avoid. An alternative would be use
>>>> ZooKeeper and perform the synchronization there. Lats but not least I could
>>>> also make make the aggregation bolt into a data base and have the training
>>>> bolt periodically wake up and read the data base. Does anybody has
>>>> experience with such a setup?
>>>>
>>>> Kind Regards,
>>>>
>>>> Klaus
>>>>
>>>>
>>>
>>
>

Reply via email to