Hello,

I'm trying to pass metadata from a document D1 via an outlink to a linked document D2 - the use case also mentioned in https://issues.apache.org/jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata from the CrawlDatum and add it to the NutchDocument.

This works fine as long as the link from D1 is the first link to D2 encountered by Nutch. However, it fails if a different document D3 linking to D2 is crawled before D1. The metadata of D1's outlink do not trigger Nutch to index D2 again in this case.

Working sceanrio:
D3 -> D1 -> D2
Result: D2 gets indexed with metadata coming from D1

Failing scenario:
D3 -> D1 -> D2
D3 -> D2
Result: D2 get indexed without metadata but there is a CrawlDatum containing the metadata

The only way I get the document indexed in the second scenario is using a ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if there is an inlinked CrawlDatum containing the required metadata. After setting datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the document again in the next iteration of the crawl script and its gets indexed successfully.

Is this the intended approach for the above use case? Isn't it possible to index the document without fetching it again? Any comments would be appreciated.

Best regards,
Florian

Reply via email to