Check the wiki, it's there:

http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/CommandLineOptions
http://wiki.apache.org/nutch/FAQ

The configuration explains a lot as well:

http://svn.apache.org/viewvc/nutch/trunk/conf/nutch-default.xml?view=markup


> I'm having a hard time figuring out how to get a simple crawl working for 4
> websites we'd like to add to an existing Solr index.
> 
> It seems like the requirements are pretty basic:
> 
> - 4 websites
> - Recrawl every however often (weekly? daily?)
> - Update existing Solr index that a Drupal installation is also updating
> - Remove pages that 404 that existed previously
> 
> The Drupal part is all working, the Drupal and Nutch-crawled pages both
> come up and work correctly when doing a search on the website.
> 
> So what I need help with is figuring out a crawl script that will update
> the index and also remove deleted pages.
> 
> I've been searching for quite some time, but none of the scripts that I've
> found seem to be updated to work with Nutch 1.3 correctly, and none of
> them remove the pages that 404 from the index.
> 
> Can anyone offer any suggestions?
> 
> Thanks!
> 
> -Karl

Reply via email to