Now we are using Gearman, don't know if anyone succeed using Gearman to
populate the Spouts.

And one of my questions is if it's possible to isolate the execution of
some bolts on a specific machine, so all keyword extraction for document X
is done by the same machine it crawled it. Or maybe I should not be concern
about network traffic on a Storm cluster.

Regards

Albert

On 7 October 2014 13:26, padma priya chitturi <[email protected]>
wrote:

> Storm would be feasible to your business problem. You could actually
> design the topology in  such a way that few bolts would be doing the job of
> keyword extraction, another set of bolts doing language detection etc etc.
> You can apply you clusterin g and classification algorithms of Mahout on
> streams of data processed by bolts.
>
> But only thing that i am concerned is if your data would be coming from
> some datasource like kafka, that would be great. I don't think spouts
> reading data from files would be  the best fit.
>
> Regards,
> Padma Ch
>
> On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <[email protected]>
> wrote:
>
>> Hi
>>
>> I just came across Storm when I was trying to find solutions to scale our
>> current architecture.
>>
>> We are currently downloading and processing 6M documents per day from
>> online and social media. We have a different workflow for each type of
>> document, but some of the steps are keyword extraction, language detection,
>> clustering, classification, indexation, .... We are using Gearman to
>> dispatch the job to workers.
>>
>> I'm wondering if we could integrate Storm on the current workflow and if
>> it's feasible. One of our main discussions are if we have to go to a fully
>> distributed architecture or to a semi-distributed one. I mean, distribute
>> everything or process some steps on the same machine (crawling, keyword
>> extraction, language detection, indexation). We don't know which one scales
>> more, each one has pros and cont.
>>
>> Now we have a semi-distributed one as we had network problems taking into
>> account the amount of data we were moving around. So now, all documents
>> crawled on server X, later on are dispatched through Gearman to the same
>> server, having all data on a Memcached locally.
>>
>> What do you think?
>> It's feasible to migrate to a Storm cluster?
>> Should we take into account the traffic among the Storm cluster?
>> Is there a way to isolate some bolts to be processed on the same machine
>> grouped by some field?
>>
>> Any help or comment will be appreciate. And If someone has had a similar
>> problem and has knowledge about the architecture approach will be more than
>> welcomed.
>>
>> Thanks
>>
>> Albert
>>
>
>


-- 
*Albert Vila*
R&D Manager & Software Developer


Tél. : +34 972 982 968

*www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
<http://blog.augure.es/> | *Twitter. *@AugureSpain
<https://twitter.com/AugureSpain>
*Skype *: albert.vila | *Access map.* Augure Girona
<https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>

Reply via email to