Re: Outlink with metadata

Julien Nioche Sun, 04 May 2014 11:52:28 -0700

Hi Florian

Just a few quick thoughts on this :



> Thanks for your support. I finished the crawler with the re-fetching
> solution. Despite that, I'd like to avoid the unnecessary fetch and have
> looked into the IndexerMapReduce class. It looks like a customized
> IndexingJob could call a different implementation of it. However, I'm not
> sure if I understand correctly what you mean with updates.


What I mean by update is that an existing document would get a partial
update of its fields instead of being fully re-indexed.


> Would a custom IndexerMapReduce thus only get the new metadata from the
> CrawlDatum or would it still have access to the previously fetched document
> data, too (but without manually reading segments)?


The idea is that the fields already sent to the indexer would remain
unchanged but we'd send the new metadata as an update. Not clear at this
stage how that could be implemented though.

In your case, wouldn't it be simpler to index just once at the end of the
crawl instead of per segment? Wouldn't you get the custom metadata if it
has been found - regardless of where it has been found?

Maybe 2.x would be a better match for your use case : there aren't any
segments any more and all the information (binary content / text, metadata)
is available at any time. If you discover that a document you had
previously indexed should have received some metadatum, you could maybe
force it to be reindexed at the next time. At least you would not need to
refetch it.

Julien





> The first case would not be very useful for the current crawler.
>
> Best regards,
> Florian
>
>
> Am 02.05.2014 10:51, schrieb Julien Nioche:
>
>> Hi Florian
>>
>> Interesting. Thanks for explaining it so clearly.
>> NUTCH-1622<https://issues.apache.org/jira/browse/NUTCH-1622> can't
>>
>> help for cases like these and the approach you used is the most
>> straightforward way of doing it. It means re-fetching which is a bit of a
>> pain though.
>>
>> An alternative would be to allow the re-indexing of documents even if only
>> their crawldb metadata has changed. The class IndexerMapReduce skips
>> documents without a fetchDatum (i.e nothing for this document in the
>> segment), which I believe would be the case in your failing scenario when
>> you discover D2 after the second iteration. Now what we'd also need is a
>> mechanism for updating a document, as opposed to indexing it. Not sure
>> this
>> is supported currently in our indexing framework and it wouldn't work with
>> all indexing backend : ElasticSearch allows updates but I am not sure SOLR
>> does for instance.
>>
>> Basically I can't think of another way do to it at the moment but someone
>> else might be more creative than me.
>>
>> Thanks
>>
>> Julien
>>
>>
>>
>>
>>
>>
>>
>> On 2 May 2014 06:53, Florian Schmedding <[email protected]
>> >wrote:
>>
>>  Hello,
>>>
>>> I'm trying to pass metadata from a document D1 via an outlink to a linked
>>> document D2 - the use case also mentioned in https://issues.apache.org/
>>> jira/browse/NUTCH-1622. In a custom IndexingFilter I take the metadata
>>> from the CrawlDatum and add it to the NutchDocument.
>>>
>>> This works fine as long as the link from D1 is the first link to D2
>>> encountered by Nutch. However, it fails if a different document D3
>>> linking
>>> to D2 is crawled before D1. The metadata of D1's outlink do not trigger
>>> Nutch to index D2 again in this case.
>>>
>>> Working sceanrio:
>>> D3 -> D1 -> D2
>>> Result: D2 gets indexed with metadata coming from D1
>>>
>>> Failing scenario:
>>> D3 -> D1 -> D2
>>> D3 -> D2
>>> Result: D2 get indexed without metadata but there is a CrawlDatum
>>> containing the metadata
>>>
>>> The only way I get the document indexed in the second scenario is using
>>>  a
>>> ScoringFilter and changing the CrawlDatum "datum" in updateDbScore(Text
>>> url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) if
>>> there
>>> is an inlinked CrawlDatum containing the required metadata. After setting
>>> datum.setFetchTime(System.currentTimeMillis()) Nutch fetches the
>>> document
>>> again in the next iteration of the crawl script and its gets indexed
>>> successfully.
>>>
>>> Is this the intended approach for the above use case? Isn't it possible
>>> to
>>> index the document without fetching it again? Any comments would be
>>> appreciated.
>>>
>>> Best regards,
>>> Florian
>>>
>>>
>>>
>>
> --
> Dr.-Ing. Florian Schmedding
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 67777
> Fax: +49 (0) 761 - 203 97694
> E-Mail: [email protected]
> Web: www.averbis.de
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Outlink with metadata

Reply via email to