Hi Julien
Thanks for your support. I finished the crawler with the re-fetching
solution. Despite that, I'd like to avoid the unnecessary fetch and have
looked into the IndexerMapReduce class. It looks like a customized
IndexingJob could call a different implementation of it. However, I'm
not sure if I understand correctly what you mean with updates. Would a
custom IndexerMapReduce thus only get the new metadata from the
CrawlDatum or would it still have access to the previously fetched
document data, too (but without manually reading segments)? The first
case would not be very useful for the current crawler.
Best regards,
Florian
Am 02.05.2014 10:51, schrieb Julien Nioche:
Hi Florian
Interesting. Thanks for explaining it so clearly.
NUTCH-1622<https://issues.apache.org/jira/browse/NUTCH-1622> can't
help for cases like these and the approach you used is the most
straightforward way of doing it. It means re-fetching which is a bit of a
pain though.
An alternative would be to allow the re-indexing of documents even if only
their crawldb metadata has changed. The class IndexerMapReduce skips
documents without a fetchDatum (i.e nothing for this document in the
segment), which I believe would be the case in your failing scenario when
you discover D2 after the second iteration. Now what we'd also need is a
mechanism for updating a document, as opposed to indexing it. Not sure this
is supported currently in our indexing framework and it wouldn't work with
all indexing backend : ElasticSearch allows updates but I am not sure SOLR
does for instance.
Basically I can't think of another way do to it at the moment but someone
else might be more creative than me.
Thanks
Julien
On 2 May 2014 06:53, Florian Schmedding <[email protected]>wrote:
Hello,
I'm trying to pass metadata from a document D1 via an outlink to a linked
document D2 - the use case also mentioned in https://issues.apache.org/
jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata
from the CrawlDatum and add it to the NutchDocument.
This works fine as long as the link from D1 is the first link to D2
encountered by Nutch. However, it fails if a different document D3 linking
to D2 is crawled before D1. The metadata of D1's outlink do not trigger
Nutch to index D2 again in this case.
Working sceanrio:
D3 -> D1 -> D2
Result: D2 gets indexed with metadata coming from D1
Failing scenario:
D3 -> D1 -> D2
D3 -> D2
Result: D2 get indexed without metadata but there is a CrawlDatum
containing the metadata
The only way I get the document indexed in the second scenario is using a
ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text
url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if there
is an inlinked CrawlDatum containing the required metadata. After setting
datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the document
again in the next iteration of the crawl script and its gets indexed
successfully.
Is this the intended approach for the above use case? Isn't it possible to
index the document without fetching it again? Any comments would be
appreciated.
Best regards,
Florian
--
Dr.-Ing. Florian Schmedding
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg
Fon: +49 (0) 761 - 203 67777
Fax: +49 (0) 761 - 203 97694
E-Mail: [email protected]
Web: www.averbis.de
Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080