Hi Albert,

You can use "local or shuffle grouping".

https://storm.incubator.apache.org/documentation/Concepts.html (stream
groupings)

Onur


On Wed, Oct 8, 2014 at 10:11 AM, Albert Vila <[email protected]> wrote:

> Now we are using Gearman, don't know if anyone succeed using Gearman to
> populate the Spouts.
>
> And one of my questions is if it's possible to isolate the execution of
> some bolts on a specific machine, so all keyword extraction for document X
> is done by the same machine it crawled it. Or maybe I should not be concern
> about network traffic on a Storm cluster.
>
> Regards
>
> Albert
>
> On 7 October 2014 13:26, padma priya chitturi <[email protected]>
> wrote:
>
>> Storm would be feasible to your business problem. You could actually
>> design the topology in  such a way that few bolts would be doing the job of
>> keyword extraction, another set of bolts doing language detection etc etc.
>> You can apply you clusterin g and classification algorithms of Mahout on
>> streams of data processed by bolts.
>>
>> But only thing that i am concerned is if your data would be coming from
>> some datasource like kafka, that would be great. I don't think spouts
>> reading data from files would be  the best fit.
>>
>> Regards,
>> Padma Ch
>>
>> On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <[email protected]>
>> wrote:
>>
>>> Hi
>>>
>>> I just came across Storm when I was trying to find solutions to scale
>>> our current architecture.
>>>
>>> We are currently downloading and processing 6M documents per day from
>>> online and social media. We have a different workflow for each type of
>>> document, but some of the steps are keyword extraction, language detection,
>>> clustering, classification, indexation, .... We are using Gearman to
>>> dispatch the job to workers.
>>>
>>> I'm wondering if we could integrate Storm on the current workflow and if
>>> it's feasible. One of our main discussions are if we have to go to a fully
>>> distributed architecture or to a semi-distributed one. I mean, distribute
>>> everything or process some steps on the same machine (crawling, keyword
>>> extraction, language detection, indexation). We don't know which one scales
>>> more, each one has pros and cont.
>>>
>>> Now we have a semi-distributed one as we had network problems taking
>>> into account the amount of data we were moving around. So now, all
>>> documents crawled on server X, later on are dispatched through Gearman to
>>> the same server, having all data on a Memcached locally.
>>>
>>> What do you think?
>>> It's feasible to migrate to a Storm cluster?
>>> Should we take into account the traffic among the Storm cluster?
>>> Is there a way to isolate some bolts to be processed on the same machine
>>> grouped by some field?
>>>
>>> Any help or comment will be appreciate. And If someone has had a similar
>>> problem and has knowledge about the architecture approach will be more than
>>> welcomed.
>>>
>>> Thanks
>>>
>>> Albert
>>>
>>
>>
>
>
> --
> *Albert Vila*
> R&D Manager & Software Developer
>
>
> Tél. : +34 972 982 968
>
> *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
> <http://blog.augure.es/> | *Twitter. *@AugureSpain
> <https://twitter.com/AugureSpain>
> *Skype *: albert.vila | *Access map.* Augure Girona
> <https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>
>



-- 
Onur Ünlü

Reply via email to