Each item in the CrawlDB carries a status field. Reading the CrawlDB will return this information as well, the same goes for a complete dump with which you could create the appropriate delete statements for your Solr instance.
51 /** Page no longer exists. */ 52 public static final byte STATUS_DB_GONE = 0x03; http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup > Where is that information stored? it could be then easily used to issue > deletes on solr. > > On 1/23/11 10:32 PM, Markus Jelsma wrote: > > Nutch can detect 404's by recrawling existing URL's. The mutation, > > however, is not pushed to Solr at the moment. > > > >> As far as I know, Nutch can only discover new URLs to crawl and send the > >> parsed content to Solr. But what about maintaining the index? Say that > >> you have a daily Nutch script that fetches/parses the web and updates > >> the Solr index. After one month, several web pages have been modified > >> and some have also been deleted. In other words, the Solr index is out > >> of sync. > >> > >> Is it possible to detect such changes in order to send update/delete > >> commands to Solr? > >> > >> It looks like the Aperture crawler has a workaround for this since the > >> crawler handler have methods such as objectChanged(...): > >> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers > >> > >> Erlend

