Hello, I am using nutch-1.3 to crawl an intranet site. For testing purposes, I created a local test website with index.html and 3 links to other 3 local html pages (page1.html, page2.html, page3.html).
To crawl I ran: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl -depth 3 -topN 10 After I check the status: ./bin/nutch readdb ./crawl/crawldb -stats CrawlDb statistics start: ./crawl/crawldb Statistics for CrawlDb: ./crawl/crawldb TOTAL urls: 4 retry 0: 4 min score: 0.666 avg score: 0.7495 max score: 1.0 status 2 (db_fetched): 1 status 6 (db_notmodified): 3 CrawlDb statistics: done At this point the site is indexed into Solr. After I remove page3.html and a hyperlink to it from the home page and rerun: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl -depth 3 -topN 10 ./bin/nutch readdb ./crawl/crawldb -stats CrawlDb statistics start: ./crawl/crawldb Statistics for CrawlDb: ./crawl/crawldb TOTAL urls: 4 retry 0: 4 min score: 0.666 avg score: 1.7495 max score: 2.666 status 1 (db_unfetched): 1 status 6 (db_notmodified): 3 CrawlDb statistics: done Checking removed page in crawldb yields: ./bin/nutch readdb ./crawl/crawldb -url http://localhost/page3.html URL: http://localhost/page3.html Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Aug 03 15:29:35 EDT 2011 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 5 seconds (0 days) Score: 0.6666667 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page3.html Now, looks like page3.html was marked as Not Found. Then, I run solrclean: ./bin/nutch solrclean crawl/crawldb http://localhost:8983/solr/ SolrClean: starting at 2011-08-03 15:40:37 SolrClean: deleted a total of 0 documents SolrClean: finished at 2011-08-03 15:40:39, elapsed: 00:00:01 I don’t understand why page3.html was not deleted from solr. I also tried running: inject generate fetch parse updatedb invertlinks solrindex which gave me the same result. Please help. - Alex

