Hi,
in general, it should be possible to adapt Nutch to this task:
1 inject 100k URLs
* fixed fetch interval for each can be defined in seed list:
url \t nutchFetchIntervalMDName=<interval in seconds>
2 generate fetch list(s)
* select pages which need to be checked now
* partion by host (and/or parser)
3 fetch and parse (fetcher.parse = true)
* eventually, report errors immediately
* do not store raw and parsed content,
* only keep fetch and parse status, and fetch time
4 update: add (next) fetch time and status to WebTable / CrawlDb
Repeat 2-4, from time to time: inject new URLs
Difficulties may include:
* "mapped to 100 parsers": does it mean 100 configurations
(or syntactic patterns) or really 100 parser objects?
For the latter it may more efficient to hold the objects
in a server, then to create them anew in every fetch-and-parse job
* every of these steps involves one more MapReduce jobs
which means a certain amount of overhead and delay
(in seconds) for job management, and creating a JVM for each job.
If the check intervals have to be followed precisely,
the scheduling provided by Nutch may not be ideal.
But if it's that, e.g., a 1-min-page is checked after 80s
there should be no problem.
Sebastian
On 08/16/2014 05:59 AM, howard chen wrote:
> Hello
>
> We are in the process of evaluating different opensource solutions for
> our distributed monitoring solution.
>
> Currently the system is developed in house, basic features are:
>
> - there are over 100K urls to monitor at a specific interval, 1 min, 5
> min, 15 min
> - these 100K urls are mapped to 100 parsers, for checking different
> syntax appear in the HTML
> - send out alert if parser failed
>
> While it is not exactly a crawler, but are very similar in nature.
>
> We are looking at a solution that we can focus on our business logic
> (i.e. the parsers), rather than the moving parts of the system (e.g.
> how to distribute, how to queue etc).
>
> Do you think nutch would be a good candidate?
>
> Thanks.
>