Storm would be feasible to your business problem. You could actually design
the topology in  such a way that few bolts would be doing the job of
keyword extraction, another set of bolts doing language detection etc etc.
You can apply you clusterin g and classification algorithms of Mahout on
streams of data processed by bolts.

But only thing that i am concerned is if your data would be coming from
some datasource like kafka, that would be great. I don't think spouts
reading data from files would be  the best fit.

Regards,
Padma Ch

On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <[email protected]> wrote:

> Hi
>
> I just came across Storm when I was trying to find solutions to scale our
> current architecture.
>
> We are currently downloading and processing 6M documents per day from
> online and social media. We have a different workflow for each type of
> document, but some of the steps are keyword extraction, language detection,
> clustering, classification, indexation, .... We are using Gearman to
> dispatch the job to workers.
>
> I'm wondering if we could integrate Storm on the current workflow and if
> it's feasible. One of our main discussions are if we have to go to a fully
> distributed architecture or to a semi-distributed one. I mean, distribute
> everything or process some steps on the same machine (crawling, keyword
> extraction, language detection, indexation). We don't know which one scales
> more, each one has pros and cont.
>
> Now we have a semi-distributed one as we had network problems taking into
> account the amount of data we were moving around. So now, all documents
> crawled on server X, later on are dispatched through Gearman to the same
> server, having all data on a Memcached locally.
>
> What do you think?
> It's feasible to migrate to a Storm cluster?
> Should we take into account the traffic among the Storm cluster?
> Is there a way to isolate some bolts to be processed on the same machine
> grouped by some field?
>
> Any help or comment will be appreciate. And If someone has had a similar
> problem and has knowledge about the architecture approach will be more than
> welcomed.
>
> Thanks
>
> Albert
>

Reply via email to