This is default behaviour. If pages are scheduled for fetching they will show up in the next segment. If you index that segment the old document in Solr is overwritten.
> But we also need to detect modified documents in order to trigger an > update command to Solr (an improvement of SolrIndexer). I was planning > to open a Jira issue on this missing functionality this week. > > Erlend > > On 26.01.11 18.12, Claudio Martella wrote: > > Today I had a look at the code and wrote this class. It works here on my > > test cluster. > > > > It scans the crawldb for entries carrying the STATUS_DB_GONE and it > > issues a delete to solr for those entries. > > > > Is that what you guys have in mind? Should i file a JIRA? > > > > On 1/24/11 10:26 AM, Markus Jelsma wrote: > >> Each item in the CrawlDB carries a status field. Reading the CrawlDB > >> will return this information as well, the same goes for a complete dump > >> with which you could create the appropriate delete statements for your > >> Solr instance. > >> > >> 51 /** Page no longer exists. */ > >> 52 public static final byte STATUS_DB_GONE = 0x03; > >> > >> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apac > >> he/nutch/crawl/CrawlDatum.java?view=markup > >> > >>> Where is that information stored? it could be then easily used to issue > >>> deletes on solr. > >>> > >>> On 1/23/11 10:32 PM, Markus Jelsma wrote: > >>>> Nutch can detect 404's by recrawling existing URL's. The mutation, > >>>> however, is not pushed to Solr at the moment. > >>>> > >>>>> As far as I know, Nutch can only discover new URLs to crawl and send > >>>>> the parsed content to Solr. But what about maintaining the index? > >>>>> Say that you have a daily Nutch script that fetches/parses the web > >>>>> and updates the Solr index. After one month, several web pages have > >>>>> been modified and some have also been deleted. In other words, the > >>>>> Solr index is out of sync. > >>>>> > >>>>> Is it possible to detect such changes in order to send update/delete > >>>>> commands to Solr? > >>>>> > >>>>> It looks like the Aperture crawler has a workaround for this since > >>>>> the crawler handler have methods such as objectChanged(...): > >>>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers > >>>>> > >>>>> Erlend

