Hi Albert, You can use "local or shuffle grouping".
https://storm.incubator.apache.org/documentation/Concepts.html (stream groupings) Onur On Wed, Oct 8, 2014 at 10:11 AM, Albert Vila <[email protected]> wrote: > Now we are using Gearman, don't know if anyone succeed using Gearman to > populate the Spouts. > > And one of my questions is if it's possible to isolate the execution of > some bolts on a specific machine, so all keyword extraction for document X > is done by the same machine it crawled it. Or maybe I should not be concern > about network traffic on a Storm cluster. > > Regards > > Albert > > On 7 October 2014 13:26, padma priya chitturi <[email protected]> > wrote: > >> Storm would be feasible to your business problem. You could actually >> design the topology in such a way that few bolts would be doing the job of >> keyword extraction, another set of bolts doing language detection etc etc. >> You can apply you clusterin g and classification algorithms of Mahout on >> streams of data processed by bolts. >> >> But only thing that i am concerned is if your data would be coming from >> some datasource like kafka, that would be great. I don't think spouts >> reading data from files would be the best fit. >> >> Regards, >> Padma Ch >> >> On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <[email protected]> >> wrote: >> >>> Hi >>> >>> I just came across Storm when I was trying to find solutions to scale >>> our current architecture. >>> >>> We are currently downloading and processing 6M documents per day from >>> online and social media. We have a different workflow for each type of >>> document, but some of the steps are keyword extraction, language detection, >>> clustering, classification, indexation, .... We are using Gearman to >>> dispatch the job to workers. >>> >>> I'm wondering if we could integrate Storm on the current workflow and if >>> it's feasible. One of our main discussions are if we have to go to a fully >>> distributed architecture or to a semi-distributed one. I mean, distribute >>> everything or process some steps on the same machine (crawling, keyword >>> extraction, language detection, indexation). We don't know which one scales >>> more, each one has pros and cont. >>> >>> Now we have a semi-distributed one as we had network problems taking >>> into account the amount of data we were moving around. So now, all >>> documents crawled on server X, later on are dispatched through Gearman to >>> the same server, having all data on a Memcached locally. >>> >>> What do you think? >>> It's feasible to migrate to a Storm cluster? >>> Should we take into account the traffic among the Storm cluster? >>> Is there a way to isolate some bolts to be processed on the same machine >>> grouped by some field? >>> >>> Any help or comment will be appreciate. And If someone has had a similar >>> problem and has knowledge about the architecture approach will be more than >>> welcomed. >>> >>> Thanks >>> >>> Albert >>> >> >> > > > -- > *Albert Vila* > R&D Manager & Software Developer > > > Tél. : +34 972 982 968 > > *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action > <http://blog.augure.es/> | *Twitter. *@AugureSpain > <https://twitter.com/AugureSpain> > *Skype *: albert.vila | *Access map.* Augure Girona > <https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16> > -- Onur Ünlü
