Re: Regarding Compression Tool

2013-09-16 Thread Jebarlin Robertson
I am using Apache Lucene in Android. I have around 1 GB of Text documents (Logs). When I Index these text documents using this *new Field(ContentIndex.KEY_TEXTCONTENT, contents, Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS_OFFSETS)*, the index directory is consuming 1.59GB

Multiple field instances and Field.Store.NO

2013-09-16 Thread Alan Burlison
I'm creating multiple instances of a field, some with Field.Store.YES and some with Field.Store.NO, with Lucene 4.4. If Field.Store.YES is set then I see multiple instances of the field in the documents in the resulting index, if I use Field.Store.NO then I only see a single field. Is that

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Ian Lea
Not exactly dumb, and I can't tell you exactly what is happening here, but lucene stores some info at the index level rather than the field level, and things can get confusing if you don't use the same Field definition consistently for a field. From the javadocs for

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Michael McCandless
That is strange. If you use Field.Store.NO for all fields for a given document then no field should have been stored. Can you boil this down to a small test case? Mike McCandless http://blog.mikemccandless.com On Mon, Sep 16, 2013 at 6:33 AM, Alan Burlison alan.burli...@gmail.com wrote: I'm

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Alan Burlison
On 16 September 2013 11:47, Ian Lea ian@gmail.com wrote: Not exactly dumb, and I can't tell you exactly what is happening here, but lucene stores some info at the index level rather than the field level, and things can get confusing if you don't use the same Field definition consistently

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Alan Burlison
On 16 September 2013 12:40, Michael McCandless luc...@mikemccandless.com wrote: If you use Field.Store.NO for all fields for a given document then no field should have been stored. Can you boil this down to a small test case? repeated calls to doc.add(new TextField(content, c,

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Michael McCandless
On Mon, Sep 16, 2013 at 9:52 AM, Alan Burlison alan.burli...@gmail.com wrote: On 16 September 2013 12:40, Michael McCandless luc...@mikemccandless.com wrote: If you use Field.Store.NO for all fields for a given document then no field should have been stored. Can you boil this down to a small

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Robert Muir
Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless of normalization form or whether its normalized at all? But for other tokenizers, such a charfilter should be useful: there is a JIRA for it, but it has some unresolved issues

Lucene Query Syntax with analyzed and unanalyzed text

2013-09-16 Thread Scott Smith
I want to be sure I understand this correctly. Suppose I have a search that I'm going to run through the query parser that looks like: body:some phrase AND keyword:my-keyword clearly body and keyword are field names. However, the additional information is that the body field is analyzed and

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Alan Burlison
Is Luke showing you stored fields? If so, this makes no sense ... Field.Store.NO (single or multiple calls) should have resulted in no stored fields. It shows the field but shows the content as not present or not stored -- Alan Burlison --

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Thanks, I might pitch in. On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir rcm...@gmail.com wrote: Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless of normalization form or whether its normalized at all? But for other tokenizers, such a charfilter

org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Can anyone shed light as to why this is a token filter and not a char filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the tokenizer's lookups in its dictionaries are seeing normalized contents.

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Robert Muir
That would be great! On Mon, Sep 16, 2013 at 1:41 PM, Benson Margulies ben...@basistech.com wrote: Thanks, I might pitch in. On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir rcm...@gmail.com wrote: Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless of

exception while writing to index

2013-09-16 Thread nischal reddy
Hi, I am getting an exception while indexing files, i tried debugging but couldnt figure out the problem. I have a custom analyzer which creates the token stream , i am indexing around 15k files, when i start the indexing after some time i get this exception:

IndexUpdater (4.4.0) fails when -verbose is not set

2013-09-16 Thread Bruce Karsh
Here it fails because -verbose is not set: $ java -cp ./lucene-core-4.4-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader ./INDEX Exception in thread main java.lang.IllegalArgumentException: printStream must not be null at

RE: IndexUpdater (4.4.0) fails when -verbose is not set

2013-09-16 Thread Uwe Schindler
Hi Bruce, Thanks for investigating! Can you open a bug report on https://issues.apache.org/jira/browse/LUCENE ? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Bruce Karsh