Hugo Pinto wrote:
Hello,
I am using Nutch for mirroring, rather than crawling and indexing.
I need to access directly the cached data in my Nutch index, but I am
unable to find an easy way to do so.
I browsed the documentation(wiki, javadocs, and skimmed the code), but
found no straightforward way to do it.
Would anyone suggest a place to look for more information, or perhaps
have done this before and could share a few tips?
Most likely what you need is not the Lucene index, but the segments
(shards), right? There's a utility called SegmentReader (available from
cmd-line as readseg), and you can use its API to retrieve either all or
individual records from a segment (using URL as key).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com