Re: Nutch not passing latest CrawlDatum to IndexingFilter plugin

Sebastian Nagel Tue, 18 Jun 2013 14:43:06 -0700

Hi Liaokz,

> After debugging, I could confirm that in CrawlDbReducer.java, Nutch really
> return the latest CrawlDatum(at the line of "output.collect(key, result);"
> the member "result" has the latest data). I suppose the latest CrawlDatum
> is wrtten to CrawlDB. Isn't it right?
No, or only partially:
- multiple CrawlDatums are merged:
  determine new status, fetch time, etc.
  It is not that the last datum is just written into CrawlDb.
  Of course, the newest CrawlDatum has some prominence
  because it contains the most recent status of the accessed document.
- this applies only if a dbStatus and fetchStatus are merged
  (eg, old status "fetched" + new status "linked" only changes the
   score of the old "fetched" CrawlDatum)


What you see in CrawlDb is a mixture of
- current status (unfetched, fetched, gone, notmodified, etc.)
- some prospects how the document should be re-crawled in the future:
  * fetch time in CrawlDb is the time when the doc should be re-fetched
  * status may be set to another one, eg., to "unfetched" to force a re-fetch

This answers
> 2. in line 273, why pass fetchDatum rather than dbDatum
The latest fetchDatum is more relevant because it's the
state when the document has been fetched. And the dbDatum
has possibly changed its statehave.

> 1. how it's sure that the four variable get the latest data? Or it does not
> matter?
It does not matter if you only index one segment (that's the way used by 
bin/crawl).
It matters if you index multiple segments at once the ordering is followed
because segments are passed to the indexer oldest first: the newest fetchDatum
is used.

Bye,
Sebastian


On 06/17/2013 11:18 AM, liaokz wrote:
> Hello all
> 
> I have written a plugin which implements the IndexingFilter interface. The
> plugin extracts some custom metadata fields from the CrawlDatum and adds
> them to NutchDocument to make the fields be indexed by Solr.
> 
> To make debug easy, I config Nutch to crawl my local website, so the
> crawled result could within my control. When I changed some content in
> pages, the crawled metadata should be also changed. But when indexed, Solr
> did not get the changed data.
> 
> After debugging, I could confirm that in CrawlDbReducer.java, Nutch really
> return the latest CrawlDatum(at the line of "output.collect(key, result);"
> the member "result" has the latest data). I suppose the latest CrawlDatum
> is wrtten to CrawlDB. Isn't it right?
> 
> Then I trace down the process in IndexerMapReduce.java. in method "reduce",
> the while loop the dbDatum, fetchDatum, parseData and parseText is
> assigned. Now comes my doubt:
> 1. how it's sure that the four variable get the latest data? Or it does not
> matter?
> 2. in line 273, why pass fetchDatum rather than dbDatum
> to this.filters.filter? It seems that the dbDatum is the one stored in
> CrawlDB and is the latest one.
> 
> Thanks in advance.
> Liaokz
>

Re: Nutch not passing latest CrawlDatum to IndexingFilter plugin

Reply via email to