Hi Howard (and Sebb),

You could do it with Nutch but due to the batch nature of MapReduce it is
not a natural fit e.g. no guarantee that the previous batch operation will
be finished in time for the next one. There could be ways around this but
the whole thing would get rather convoluted and difficult to maintain and
run in prod.

Instead you could use a realtime system like Storm which would simplify the
logic around the scheduling of the fetches. See
https://github.com/DigitalPebble/storm-crawler for components you could
reuse to that effect. Sounds like what you need is fairly straightforward
(no recursive discovery of new URLs, etc...) and it should not be too
difficult to do it with Storm.

Julien

PS: am on holiday this week, probably won't have access to the web for some
time



On 18 August 2014 09:51, howard chen <[email protected]> wrote:

> Hello
>
> On Sat, Aug 16, 2014 at 11:02 PM, Sebastian Nagel
> <[email protected]> wrote:
> > * "mapped to 100 parsers": does it mean 100 configurations
> >   (or syntactic patterns) or really 100 parser objects?
>
>
> For each website we crawl (monitor), we need to run a set of tests
> again it, so we only download the HTML once, but run as much as 100
> tests against the HTML. The current system suck because when we have
> added new test or update the existing test codes, we need to stop and
> restart whole cluster, and we think we shouldn't waste time on
> reinventing a a distributed task system, so we are looking if any
> existing opensource solutions would be a better choice.
>
> Thanks
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to