Nutch not passing latest CrawlDatum to IndexingFilter plugin

liaokz Mon, 17 Jun 2013 17:38:48 -0700

Hello all

I have written a plugin which implements the IndexingFilter interface. The
plugin extracts some custom metadata fields from the CrawlDatum and adds
them to NutchDocument to make the fields be indexed by Solr.


To make debug easy, I config Nutch to crawl my local website, so the
crawled result could within my control. When I changed some content in
pages, the crawled metadata should be also changed. But when indexed, Solr
did not get the changed data.

After debugging, I could confirm that in CrawlDbReducer.java, Nutch really
return the latest CrawlDatum(at the line of "output.collect(key, result);"
the member "result" has the latest data). I suppose the latest CrawlDatum
is wrtten to CrawlDB. Isn't it right?

Then I trace down the process in IndexerMapReduce.java. in method "reduce",
the while loop the dbDatum, fetchDatum, parseData and parseText is
assigned. Now comes my doubt:
1. how it's sure that the four variable get the latest data? Or it does not
matter?
2. in line 273, why pass fetchDatum rather than dbDatum
to this.filters.filter? It seems that the dbDatum is the one stored in
CrawlDB and is the latest one.

Thanks in advance.
Liaokz

Nutch not passing latest CrawlDatum to IndexingFilter plugin

Reply via email to