Hi Florian

Interesting. Thanks for explaining it so clearly.
NUTCH-1622<https://issues.apache.org/jira/browse/NUTCH-1622> can't
help for cases like these and the approach you used is the most
straightforward way of doing it. It means re-fetching which is a bit of a
pain though.

An alternative would be to allow the re-indexing of documents even if only
their crawldb metadata has changed. The class IndexerMapReduce skips
documents without a fetchDatum (i.e nothing for this document in the
segment), which I believe would be the case in your failing scenario when
you discover D2 after the second iteration. Now what we'd also need is a
mechanism for updating a document, as opposed to indexing it. Not sure this
is supported currently in our indexing framework and it wouldn't work with
all indexing backend : ElasticSearch allows updates but I am not sure SOLR
does for instance.

Basically I can't think of another way do to it at the moment but someone
else might be more creative than me.

Thanks

Julien







On 2 May 2014 06:53, Florian Schmedding <[email protected]>wrote:

> Hello,
>
> I'm trying to pass metadata from a document D1 via an outlink to a linked
> document D2 - the use case also mentioned in https://issues.apache.org/
> jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata
> from the CrawlDatum and add it to the NutchDocument.
>
> This works fine as long as the link from D1 is the first link to D2
> encountered by Nutch. However, it fails if a different document D3 linking
> to D2 is crawled before D1. The metadata of D1's outlink do not trigger
> Nutch to index D2 again in this case.
>
> Working sceanrio:
> D3 -> D1 -> D2
> Result: D2 gets indexed with metadata coming from D1
>
> Failing scenario:
> D3 -> D1 -> D2
> D3 -> D2
> Result: D2 get indexed without metadata but there is a CrawlDatum
> containing the metadata
>
> The only way I get the document indexed in the second scenario is using  a
> ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text
> url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if there
> is an inlinked CrawlDatum containing the required metadata. After setting
> datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the document
> again in the next iteration of the crawl script and its gets indexed
> successfully.
>
> Is this the intended approach for the above use case? Isn't it possible to
> index the document without fetching it again? Any comments would be
> appreciated.
>
> Best regards,
> Florian
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to