Hi Howard (and Sebb), You could do it with Nutch but due to the batch nature of MapReduce it is not a natural fit e.g. no guarantee that the previous batch operation will be finished in time for the next one. There could be ways around this but the whole thing would get rather convoluted and difficult to maintain and run in prod.
Instead you could use a realtime system like Storm which would simplify the logic around the scheduling of the fetches. See https://github.com/DigitalPebble/storm-crawler for components you could reuse to that effect. Sounds like what you need is fairly straightforward (no recursive discovery of new URLs, etc...) and it should not be too difficult to do it with Storm. Julien PS: am on holiday this week, probably won't have access to the web for some time On 18 August 2014 09:51, howard chen <[email protected]> wrote: > Hello > > On Sat, Aug 16, 2014 at 11:02 PM, Sebastian Nagel > <[email protected]> wrote: > > * "mapped to 100 parsers": does it mean 100 configurations > > (or syntactic patterns) or really 100 parser objects? > > > For each website we crawl (monitor), we need to run a set of tests > again it, so we only download the HTML once, but run as much as 100 > tests against the HTML. The current system suck because when we have > added new test or update the existing test codes, we need to stop and > restart whole cluster, and we think we shouldn't waste time on > reinventing a a distributed task system, so we are looking if any > existing opensource solutions would be a better choice. > > Thanks > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

