Check the wiki, it's there: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/CommandLineOptions http://wiki.apache.org/nutch/FAQ
The configuration explains a lot as well: http://svn.apache.org/viewvc/nutch/trunk/conf/nutch-default.xml?view=markup > I'm having a hard time figuring out how to get a simple crawl working for 4 > websites we'd like to add to an existing Solr index. > > It seems like the requirements are pretty basic: > > - 4 websites > - Recrawl every however often (weekly? daily?) > - Update existing Solr index that a Drupal installation is also updating > - Remove pages that 404 that existed previously > > The Drupal part is all working, the Drupal and Nutch-crawled pages both > come up and work correctly when doing a search on the website. > > So what I need help with is figuring out a crawl script that will update > the index and also remove deleted pages. > > I've been searching for quite some time, but none of the scripts that I've > found seem to be updated to work with Nutch 1.3 correctly, and none of > them remove the pages that 404 from the index. > > Can anyone offer any suggestions? > > Thanks! > > -Karl

