Hi Florian Interesting. Thanks for explaining it so clearly. NUTCH-1622<https://issues.apache.org/jira/browse/NUTCH-1622> can't help for cases like these and the approach you used is the most straightforward way of doing it. It means re-fetching which is a bit of a pain though.
An alternative would be to allow the re-indexing of documents even if only their crawldb metadata has changed. The class IndexerMapReduce skips documents without a fetchDatum (i.e nothing for this document in the segment), which I believe would be the case in your failing scenario when you discover D2 after the second iteration. Now what we'd also need is a mechanism for updating a document, as opposed to indexing it. Not sure this is supported currently in our indexing framework and it wouldn't work with all indexing backend : ElasticSearch allows updates but I am not sure SOLR does for instance. Basically I can't think of another way do to it at the moment but someone else might be more creative than me. Thanks Julien On 2 May 2014 06:53, Florian Schmedding <[email protected]>wrote: > Hello, > > I'm trying to pass metadata from a document D1 via an outlink to a linked > document D2 - the use case also mentioned in https://issues.apache.org/ > jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata > from the CrawlDatum and add it to the NutchDocument. > > This works fine as long as the link from D1 is the first link to D2 > encountered by Nutch. However, it fails if a different document D3 linking > to D2 is crawled before D1. The metadata of D1's outlink do not trigger > Nutch to index D2 again in this case. > > Working sceanrio: > D3 -> D1 -> D2 > Result: D2 gets indexed with metadata coming from D1 > > Failing scenario: > D3 -> D1 -> D2 > D3 -> D2 > Result: D2 get indexed without metadata but there is a CrawlDatum > containing the metadata > > The only way I get the document indexed in the second scenario is using a > ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text > url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if there > is an inlinked CrawlDatum containing the required metadata. After setting > datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the document > again in the next iteration of the crawl script and its gets indexed > successfully. > > Is this the intended approach for the above use case? Isn't it possible to > index the document without fetching it again? Any comments would be > appreciated. > > Best regards, > Florian > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

