Hello,
I'm trying to pass metadata from a document D1 via an outlink to a
linked document D2 - the use case also mentioned in
https://issues.apache.org/jira/browse/NUTCH-1622. In a custom
IndexingFilter I take the metadata from the CrawlDatum and add it to the
NutchDocument.
This works fine as long as the link from D1 is the first link to D2
encountered by Nutch. However, it fails if a different document D3
linking to D2 is crawled before D1. The metadata of D1's outlink do not
trigger Nutch to index D2 again in this case.
Working sceanrio:
D3 -> D1 -> D2
Result: D2 gets indexed with metadata coming from D1
Failing scenario:
D3 -> D1 -> D2
D3 -> D2
Result: D2 get indexed without metadata but there is a CrawlDatum
containing the metadata
The only way I get the document indexed in the second scenario is using
a ScoringFilter and changing the CrawlDatum "datum" in
updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
List<CrawlDatum> inlinked) if there is an inlinked CrawlDatum containing
the required metadata. After setting
datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the
document again in the next iteration of the crawl script and its gets
indexed successfully.
Is this the intended approach for the above use case? Isn't it possible
to index the document without fetching it again? Any comments would be
appreciated.
Best regards,
Florian