THX,

the idea is good, I will keep that in mind. The only drawback is that it
relies on polling, what I do not like to much in the PredictionBolt. Off
couse I could also pass S3 or File refernces around in the messages, to
trigger an update. But for the sake of simplicity I was thinking of keeping
everything in storm and do not rely if possible on other system.

Cheers,

Klaus


On Wed, Feb 26, 2014 at 12:22 PM, Enno Shioji <[email protected]> wrote:

> I can't comment on how large tuples fare, but about the synchronization,
> would this not make more sense?
>
> InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>              |             |
>                           \/                           |
>                        Agg. State                |
>             /\             |
>                           |                            V
>                        TrainingBolt -----> Model State
>
> I.e. AggregationBolt writes to AggregationState, which is polled by
> TrainingBolt, which writes to ModelState. ModelState is then polled by
> PredictionBolt.
>
> This way, you can get rid of the large tuples as well and use instead
> something like S3 for these large states.
>
>
>
>
>
> On Wed, Feb 26, 2014 at 11:02 AM, Klausen Schaefersinho <
> [email protected]> wrote:
>
>> Hi,
>>
>> I have a topology which process events and aggregates them in some form
>> and performs some prediction based on a machine learning (ML) model. Every
>> x events the one of the bolt involved in the normal processing emit an
>> "trainModel" event, which is routed to a bolt which is just dedicated to
>> the training. One the training is done, the new model should be send back
>> to the prediction bolt. The topology looks like:
>>
>>
>>  InputSpout -> AggregationBolt -> PredictionBolt -> OutputBolt
>>              |             /\
>>                           \/                           |
>>                        TrainingBolt -------------+
>>
>>
>> The model can get quite large (> 100 mb) so I am not sure how this would
>> impact the performance of my cluster.  Does anybody has experiences with
>> transmitting large messages?
>>
>> Also the training might take a while, so the aggregation bolt should not
>> trigger the training bolt if he is busy. Is there an established patterns
>> how to archive this kind of synchronization? I could have some streams to
>> send states, but then I would mix data stream with control stream, what I
>> really would like to avoid. An alternative would be use ZooKeeper and
>> perform the synchronization there. Lats but not least I could also make
>> make the aggregation bolt into a data base and have the training bolt
>> periodically wake up and read the data base. Does anybody has experience
>> with such a setup?
>>
>> Kind Regards,
>>
>> Klaus
>>
>>
>

Reply via email to