Nutch can detect 404's by recrawling existing URL's. The mutation, however, is 
not pushed to Solr at the moment.

> As far as I know, Nutch can only discover new URLs to crawl and send the
> parsed content to Solr. But what about maintaining the index? Say that
> you have a daily Nutch script that fetches/parses the web and updates
> the Solr index. After one month, several web pages have been modified
> and some have also been deleted. In other words, the Solr index is out
> of sync.
> 
> Is it possible to detect such changes in order to send update/delete
> commands to Solr?
> 
> It looks like the Aperture crawler has a workaround for this since the
> crawler handler have methods such as objectChanged(...):
> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> 
> Erlend

Reply via email to