Hello all I have written a plugin which implements the IndexingFilter interface. The plugin extracts some custom metadata fields from the CrawlDatum and adds them to NutchDocument to make the fields be indexed by Solr.
To make debug easy, I config Nutch to crawl my local website, so the crawled result could within my control. When I changed some content in pages, the crawled metadata should be also changed. But when indexed, Solr did not get the changed data. After debugging, I could confirm that in CrawlDbReducer.java, Nutch really return the latest CrawlDatum(at the line of "output.collect(key, result);" the member "result" has the latest data). I suppose the latest CrawlDatum is wrtten to CrawlDB. Isn't it right? Then I trace down the process in IndexerMapReduce.java. in method "reduce", the while loop the dbDatum, fetchDatum, parseData and parseText is assigned. Now comes my doubt: 1. how it's sure that the four variable get the latest data? Or it does not matter? 2. in line 273, why pass fetchDatum rather than dbDatum to this.filters.filter? It seems that the dbDatum is the one stored in CrawlDB and is the latest one. Thanks in advance. Liaokz

