Re: Indexing directly from stdin in lucene 3.5

Adrien Grand Tue, 19 Feb 2013 15:51:20 -0800

Hi,

On Tue, Feb 19, 2013 at 11:04 AM, A. L. Benhenni <albenhe...@gmail.com> wrote:
> I am currently writing an indexer class to index texts from stdin. I also
> need the text to be tokenized and stored to access the termvector of the
> document.


Actually, you don't need to store documents to access their term
vectors, these are two different options. The stored fields allow you
to retrieve data as you provided it to the IndexWriter while term
vectors return a single-document inverted index of your document
(mapping every unique term to its frequency, the positions where it
appeared in the original document, etc.).

> 1/ Is there a more appropriate way of handling the indexing of an incoming
> stream ?

Actually, your example is very strange since (if I'm not mistaken)
each iteration of the loop overrides the previous line with the
current one (because path_field_name doesn't change).

If you want your document to be stored (Store.YES), you need to buffer
everything into a String before feeding Lucene with it.

> 2/ Is there an easy way to clean the index ?

IndexWriter.deleteAll?
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/index/IndexWriter.html#deleteAll()

> And a subsidiary 3/ Why a field can't store a reader ?

Lucene stores string fields by first writing their length, followed by
their bytes, so it couldn't even start serializing a Reader before
having consumed and buffered it fully (to know its length). Lucene
doesn't allow the creation of stored fields from a Reader because it
would give the impression of being lighweight (no need to load
everything into memory at once) although it wouldn't be. On the
contrary, you can provide a Reader to a field which is indexed and has
term vectors turned on and Lucene will manage to consume it in a truly
streaming fashion.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing directly from stdin in lucene 3.5

Reply via email to