Hi

I just came across Storm when I was trying to find solutions to scale our
current architecture.

We are currently downloading and processing 6M documents per day from
online and social media. We have a different workflow for each type of
document, but some of the steps are keyword extraction, language detection,
clustering, classification, indexation, .... We are using Gearman to
dispatch the job to workers.

I'm wondering if we could integrate Storm on the current workflow and if
it's feasible. One of our main discussions are if we have to go to a fully
distributed architecture or to a semi-distributed one. I mean, distribute
everything or process some steps on the same machine (crawling, keyword
extraction, language detection, indexation). We don't know which one scales
more, each one has pros and cont.

Now we have a semi-distributed one as we had network problems taking into
account the amount of data we were moving around. So now, all documents
crawled on server X, later on are dispatched through Gearman to the same
server, having all data on a Memcached locally.

What do you think?
It's feasible to migrate to a Storm cluster?
Should we take into account the traffic among the Storm cluster?
Is there a way to isolate some bolts to be processed on the same machine
grouped by some field?

Any help or comment will be appreciate. And If someone has had a similar
problem and has knowledge about the architecture approach will be more than
welcomed.

Thanks

Albert

Reply via email to