This is default behaviour. If pages are scheduled for fetching they will show 
up in the next segment. If you index that segment the old document in Solr is 
overwritten.

> But we also need to detect modified documents in order to trigger an
> update command to Solr (an improvement of SolrIndexer). I was planning
> to open a Jira issue on this missing functionality this week.
> 
> Erlend
> 
> On 26.01.11 18.12, Claudio Martella wrote:
> > Today I had a look at the code and wrote this class. It works here on my
> > test cluster.
> > 
> > It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> > issues a delete to solr for those entries.
> > 
> > Is that what you guys have in mind? Should i file a JIRA?
> > 
> > On 1/24/11 10:26 AM, Markus Jelsma wrote:
> >> Each item in the CrawlDB carries a status field. Reading the CrawlDB
> >> will return this information as well, the same goes for a complete dump
> >> with which you could create the appropriate delete statements for your
> >> Solr instance.
> >> 
> >> 51         /** Page no longer exists. */
> >> 52         public static final byte STATUS_DB_GONE = 0x03;
> >> 
> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apac
> >> he/nutch/crawl/CrawlDatum.java?view=markup
> >> 
> >>> Where is that information stored? it could be then easily used to issue
> >>> deletes on solr.
> >>> 
> >>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> >>>> Nutch can detect 404's by recrawling existing URL's. The mutation,
> >>>> however, is not pushed to Solr at the moment.
> >>>> 
> >>>>> As far as I know, Nutch can only discover new URLs to crawl and send
> >>>>> the parsed content to Solr. But what about maintaining the index?
> >>>>> Say that you have a daily Nutch script that fetches/parses the web
> >>>>> and updates the Solr index. After one month, several web pages have
> >>>>> been modified and some have also been deleted. In other words, the
> >>>>> Solr index is out of sync.
> >>>>> 
> >>>>> Is it possible to detect such changes in order to send update/delete
> >>>>> commands to Solr?
> >>>>> 
> >>>>> It looks like the Aperture crawler has a workaround for this since
> >>>>> the crawler handler have methods such as objectChanged(...):
> >>>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> >>>>> 
> >>>>> Erlend

Reply via email to