Hi Liaokz, > After debugging, I could confirm that in CrawlDbReducer.java, Nutch really > return the latest CrawlDatum(at the line of "output.collect(key, result);" > the member "result" has the latest data). I suppose the latest CrawlDatum > is wrtten to CrawlDB. Isn't it right? No, or only partially: - multiple CrawlDatums are merged: determine new status, fetch time, etc. It is not that the last datum is just written into CrawlDb. Of course, the newest CrawlDatum has some prominence because it contains the most recent status of the accessed document. - this applies only if a dbStatus and fetchStatus are merged (eg, old status "fetched" + new status "linked" only changes the score of the old "fetched" CrawlDatum)
What you see in CrawlDb is a mixture of - current status (unfetched, fetched, gone, notmodified, etc.) - some prospects how the document should be re-crawled in the future: * fetch time in CrawlDb is the time when the doc should be re-fetched * status may be set to another one, eg., to "unfetched" to force a re-fetch This answers > 2. in line 273, why pass fetchDatum rather than dbDatum The latest fetchDatum is more relevant because it's the state when the document has been fetched. And the dbDatum has possibly changed its statehave. > 1. how it's sure that the four variable get the latest data? Or it does not > matter? It does not matter if you only index one segment (that's the way used by bin/crawl). It matters if you index multiple segments at once the ordering is followed because segments are passed to the indexer oldest first: the newest fetchDatum is used. Bye, Sebastian On 06/17/2013 11:18 AM, liaokz wrote: > Hello all > > I have written a plugin which implements the IndexingFilter interface. The > plugin extracts some custom metadata fields from the CrawlDatum and adds > them to NutchDocument to make the fields be indexed by Solr. > > To make debug easy, I config Nutch to crawl my local website, so the > crawled result could within my control. When I changed some content in > pages, the crawled metadata should be also changed. But when indexed, Solr > did not get the changed data. > > After debugging, I could confirm that in CrawlDbReducer.java, Nutch really > return the latest CrawlDatum(at the line of "output.collect(key, result);" > the member "result" has the latest data). I suppose the latest CrawlDatum > is wrtten to CrawlDB. Isn't it right? > > Then I trace down the process in IndexerMapReduce.java. in method "reduce", > the while loop the dbDatum, fetchDatum, parseData and parseText is > assigned. Now comes my doubt: > 1. how it's sure that the four variable get the latest data? Or it does not > matter? > 2. in line 273, why pass fetchDatum rather than dbDatum > to this.filters.filter? It seems that the dbDatum is the one stored in > CrawlDB and is the latest one. > > Thanks in advance. > Liaokz >

