Hi, On Tue, Feb 19, 2013 at 11:04 AM, A. L. Benhenni <albenhe...@gmail.com> wrote: > I am currently writing an indexer class to index texts from stdin. I also > need the text to be tokenized and stored to access the termvector of the > document.
Actually, you don't need to store documents to access their term vectors, these are two different options. The stored fields allow you to retrieve data as you provided it to the IndexWriter while term vectors return a single-document inverted index of your document (mapping every unique term to its frequency, the positions where it appeared in the original document, etc.). > 1/ Is there a more appropriate way of handling the indexing of an incoming > stream ? Actually, your example is very strange since (if I'm not mistaken) each iteration of the loop overrides the previous line with the current one (because path_field_name doesn't change). If you want your document to be stored (Store.YES), you need to buffer everything into a String before feeding Lucene with it. > 2/ Is there an easy way to clean the index ? IndexWriter.deleteAll? http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/index/IndexWriter.html#deleteAll() > And a subsidiary 3/ Why a field can't store a reader ? Lucene stores string fields by first writing their length, followed by their bytes, so it couldn't even start serializing a Reader before having consumed and buffered it fully (to know its length). Lucene doesn't allow the creation of stored fields from a Reader because it would give the impression of being lighweight (no need to load everything into memory at once) although it wouldn't be. On the contrary, you can provide a Reader to a field which is indexed and has term vectors turned on and Lucene will manage to consume it in a truly streaming fashion. -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org