Hi,

in general, it should be possible to adapt Nutch to this task:

1 inject 100k URLs
  * fixed fetch interval for each can be defined in seed list:
    url \t nutchFetchIntervalMDName=<interval in seconds>

2 generate fetch list(s)
  * select pages which need to be checked now
  * partion by host (and/or parser)

3 fetch and parse (fetcher.parse = true)
  * eventually, report errors immediately
  * do not store raw and parsed content,
  * only keep fetch and parse status, and fetch time

4 update: add (next) fetch time and status to WebTable / CrawlDb

Repeat 2-4, from time to time: inject new URLs

Difficulties may include:
* "mapped to 100 parsers": does it mean 100 configurations
  (or syntactic patterns) or really 100 parser objects?
  For the latter it may more efficient to hold the objects
  in a server, then to create them anew in every fetch-and-parse job
* every of these steps involves one more MapReduce jobs
  which means a certain amount of overhead and delay
  (in seconds) for job management, and creating a JVM for each job.
  If the check intervals have to be followed precisely,
  the scheduling provided by Nutch may not be ideal.
  But if it's that, e.g., a 1-min-page is checked after 80s
  there should be no problem.


Sebastian

On 08/16/2014 05:59 AM, howard chen wrote:
> Hello
> 
> We are in the process of evaluating different opensource solutions for
> our distributed monitoring solution.
> 
> Currently the system is developed in house, basic features are:
> 
> - there are over 100K urls to monitor at a specific interval, 1 min, 5
> min, 15 min
> - these 100K urls are mapped to 100 parsers, for checking different
> syntax appear in the HTML
> - send out alert if parser failed
> 
> While it is not exactly a crawler, but are very similar in nature.
> 
> We are looking at a solution that we can focus on our business logic
> (i.e. the parsers), rather than the moving parts of the system (e.g.
> how to distribute, how to queue etc).
> 
> Do you think nutch would be a good candidate?
> 
> Thanks.
> 

Reply via email to