Re: Can Nucth detect modified and deleted URLs?

Markus Jelsma Mon, 24 Jan 2011 01:29:19 -0800

Each item in the CrawlDB carries a status field. Reading the CrawlDB will 
return this information as well, the same goes for a complete dump with which 
you could create the appropriate delete statements for your Solr instance.


51      /** Page no longer exists. */
52      public static final byte STATUS_DB_GONE = 0x03; 

http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup

> Where is that information stored? it could be then easily used to issue
> deletes on solr.
> 
> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> > Nutch can detect 404's by recrawling existing URL's. The mutation,
> > however, is not pushed to Solr at the moment.
> > 
> >> As far as I know, Nutch can only discover new URLs to crawl and send the
> >> parsed content to Solr. But what about maintaining the index? Say that
> >> you have a daily Nutch script that fetches/parses the web and updates
> >> the Solr index. After one month, several web pages have been modified
> >> and some have also been deleted. In other words, the Solr index is out
> >> of sync.
> >> 
> >> Is it possible to detect such changes in order to send update/delete
> >> commands to Solr?
> >> 
> >> It looks like the Aperture crawler has a workaround for this since the
> >> crawler handler have methods such as objectChanged(...):
> >> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> >> 
> >> Erlend

Re: Can Nucth detect modified and deleted URLs?

Reply via email to