advice, config files for crawling private wikipedia mirror

Fred Zimmerman Sat, 08 Oct 2011 10:30:40 -0700

HI,

I am looking for advice on how to configure Nutch (and Solr) to crawl a
private Wikipedia mirror.


   - It is my mirror on an intranet so I do not need to be polite to myself.
   -  I need to complete this 11 million page crawl as fast as I reasonably
   can.
   - Both crawler and mirror are 1.7GB machines dedicated to this task.
   -  I only need to crawl internal links (not external).
   - Eventually I will need to update the crawl but a monthly update will be
   sufficient.

Any advice (and sample config files) would be much appreciated!

Fred

advice, config files for crawling private wikipedia mirror

Reply via email to