Outlink with metadata

Florian Schmedding Thu, 01 May 2014 22:54:07 -0700

Hello,

I'm trying to pass metadata from a document D1 via an outlink to alinked document D2 - the use case also mentioned inhttps://issues.apache.org/jira/browse/NUTCH-1622. In a customIndexingFilter I take the metadata from the CrawlDatum and add it to theNutchDocument.

This works fine as long as the link from D1 is the first link to D2encountered by Nutch. However, it fails if a different document D3linking to D2 is crawled before D1. The metadata of D1's outlink do nottrigger Nutch to index D2 again in this case.


Working sceanrio:
D3 -> D1 -> D2
Result: D2 gets indexed with metadata coming from D1

Failing scenario:
D3 -> D1 -> D2
D3 -> D2

Result: D2 get indexed without metadata but there is a CrawlDatumcontaining the metadata

The only way I get the document indexed in the second scenario is usinga ScoringFilter and changing the CrawlDatum "datum" inupdateDbScore(Text url, CrawlDatum old, CrawlDatum datum,List<CrawlDatum> inlinked) if there is an inlinked CrawlDatum containingthe required metadata. After settingdatum.setFetchTime(System.currentTimeMillis()) Nutch fetches thedocument again in the next iteration of the crawl script and its getsindexed successfully.

Is this the intended approach for the above use case? Isn't it possibleto index the document without fetching it again? Any comments would beappreciated.


Best regards,
Florian

Outlink with metadata

Reply via email to