I'm having a hard time figuring out how to get a simple crawl working for 4 websites we'd like to add to an existing Solr index.
It seems like the requirements are pretty basic: - 4 websites - Recrawl every however often (weekly? daily?) - Update existing Solr index that a Drupal installation is also updating - Remove pages that 404 that existed previously The Drupal part is all working, the Drupal and Nutch-crawled pages both come up and work correctly when doing a search on the website. So what I need help with is figuring out a crawl script that will update the index and also remove deleted pages. I've been searching for quite some time, but none of the scripts that I've found seem to be updated to work with Nutch 1.3 correctly, and none of them remove the pages that 404 from the index. Can anyone offer any suggestions? Thanks! -Karl

