Hi Julien
Thanks again. I've created a plugin for managing the metadata for an
outlink. It uses the re-fetching approach to make sure that the metadata
get indexed. The plugin is available from
https://github.com/florianschmedding/outlinkmeta.
I think partial updates cannot solve the problem in my scenario because
documents without certain metadata should not be indexed (there is an
indexing filter for this task).
I have already looked into Nutch 2 and it would certainly be useful.
Unfortunately I'm still using a content extraction plugin that works
only with Nutch 1.
Perhaps a better approach for my scenario would be to generate only the
necessary outlinks by special parser filters instead of using just an
regex URL filter for directing the crawler. This way the respective
documents would only be fetched when the metadata is available.
Best regards,
Florian
Am 04.05.2014 20:51, schrieb Julien Nioche:
Hi Florian
Just a few quick thoughts on this :
Thanks for your support. I finished the crawler with the re-fetching
solution. Despite that, I'd like to avoid the unnecessary fetch and have
looked into the IndexerMapReduce class. It looks like a customized
IndexingJob could call a different implementation of it. However, I'm not
sure if I understand correctly what you mean with updates.
What I mean by update is that an existing document would get a partial
update of its fields instead of being fully re-indexed.
Would a custom IndexerMapReduce thus only get the new metadata from the
CrawlDatum or would it still have access to the previously fetched document
data, too (but without manually reading segments)?
The idea is that the fields already sent to the indexer would remain
unchanged but we'd send the new metadata as an update. Not clear at this
stage how that could be implemented though.
In your case, wouldn't it be simpler to index just once at the end of the
crawl instead of per segment? Wouldn't you get the custom metadata if it
has been found - regardless of where it has been found?
Maybe 2.x would be a better match for your use case : there aren't any
segments any more and all the information (binary content / text, metadata)
is available at any time. If you discover that a document you had
previously indexed should have received some metadatum, you could maybe
force it to be reindexed at the next time. At least you would not need to
refetch it.
Julien
The first case would not be very useful for the current crawler.
Best regards,
Florian
Am 02.05.2014 10:51, schrieb Julien Nioche:
Hi Florian
Interesting. Thanks for explaining it so clearly.
NUTCH-1622<https://issues.apache.org/jira/browse/NUTCH-1622> can't
help for cases like these and the approach you used is the most
straightforward way of doing it. It means re-fetching which is a bit of a
pain though.
An alternative would be to allow the re-indexing of documents even if only
their crawldb metadata has changed. The class IndexerMapReduce skips
documents without a fetchDatum (i.e nothing for this document in the
segment), which I believe would be the case in your failing scenario when
you discover D2 after the second iteration. Now what we'd also need is a
mechanism for updating a document, as opposed to indexing it. Not sure
this
is supported currently in our indexing framework and it wouldn't work with
all indexing backend : ElasticSearch allows updates but I am not sure SOLR
does for instance.
Basically I can't think of another way do to it at the moment but someone
else might be more creative than me.
Thanks
Julien
On 2 May 2014 06:53, Florian Schmedding <[email protected]
wrote:
Hello,
I'm trying to pass metadata from a document D1 via an outlink to a linked
document D2 - the use case also mentioned inhttps://issues.apache.org/
jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata
from the CrawlDatum and add it to the NutchDocument.
This works fine as long as the link from D1 is the first link to D2
encountered by Nutch. However, it fails if a different document D3
linking
to D2 is crawled before D1. The metadata of D1's outlink do not trigger
Nutch to index D2 again in this case.
Working sceanrio:
D3 -> D1 -> D2
Result: D2 gets indexed with metadata coming from D1
Failing scenario:
D3 -> D1 -> D2
D3 -> D2
Result: D2 get indexed without metadata but there is a CrawlDatum
containing the metadata
The only way I get the document indexed in the second scenario is using
a
ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text
url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if
there
is an inlinked CrawlDatum containing the required metadata. After setting
datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the
document
again in the next iteration of the crawl script and its gets indexed
successfully.
Is this the intended approach for the above use case? Isn't it possible
to
index the document without fetching it again? Any comments would be
appreciated.
Best regards,
Florian