Re: ControlledRealTimeReopenThread

2014-12-02 Thread Michael McCandless
TextField is dangerous: it is analyzed, possible into more then one token, and then your deletes won't work. It's safer to use StringField for tokens you later want to delete by. Try making a standalone test that just deletes documents first... You don't need to iw.commit to make commits

How best to compare tow sentences

2014-12-02 Thread Paul Taylor
I'm trying to compare two song titles (usually latinscript) for similarity. So Im looking for when the two titles seem to be the same song accounting for spelling mistakes, additional words ectera. For a number of years I've been doing this for some time by creating a RAMDirectory, creating a

Building non-core jar-files from lucene sources.

2014-12-02 Thread Badano Andrea
Hello, When I build lucene from source using these instructions: https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/BUILD.txt I end up with only ./build/core/lucene-core-4.10.2-SNAPSHOT.jar I would like to build lucene-analyzers-common-4.10.2.jar and lucene-queryparser-4.10.2.jar

Re: Building non-core jar-files from lucene sources.

2014-12-02 Thread Robert Muir
If you run ant -p it will print targets and descriptions. you want 'ant compile'. In my opinion the default target should not be 'jar', but print this list of targets instead, just like the top-level build file. On Tue, Dec 2, 2014 at 12:09 PM, Badano Andrea andrea.bad...@sweco.se wrote:

Total Freq for Bigrams, Trigrams, etc.

2014-12-02 Thread Peter Organisciak
It is possible to get a total corpus frequency for bigram queries or higher? i.e. How many times does the query occur in the corpus. I'm looking to implement a count of occurrences per million terms. I know for a single term I can use `TermsEnum.totalTermFreq()`, is there any comparable way to

Re: Total Freq for Bigrams, Trigrams, etc.

2014-12-02 Thread brettgleeson83
Is all the millions and random worms uncovered v runn command 1000.888 --Original Message-- From: Peter Organisciak To: java-user@lucene.apache.org ReplyTo: java-user@lucene.apache.org Subject: Total Freq for Bigrams, Trigrams, etc. Sent: Dec 2, 2014 8:38 PM It is

Re: Total Freq for Bigrams, Trigrams, etc.

2014-12-02 Thread Michael Sokolov
If you index the n-grams in their own field using ShingleFilter, you can get statistics using the same term api on that field, in which the terms *are* n-grams, and similarly for queries. -Mike On 12/02/2014 03:38 PM, Peter Organisciak wrote: It is possible to get a total corpus frequency

Re: Total Freq for Bigrams, Trigrams, etc.

2014-12-02 Thread brettgleeson83
1 madz is whorific funny asl xx Sent from my BlackBerry® wireless device -Original Message- From: Michael Sokolov msoko...@safaribooksonline.com Date: Tue, 02 Dec 2014 17:31:18 To: java-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: Re: Total Freq for Bigrams,

how to load mmap directory into memory?

2014-12-02 Thread Li Li
I am using mmap fs directory in lucene. My index is small (about 3GB in disk) and I have plenty of memory available. The problem is that when the term is first queried, it's slow. How can I load all directory into memory? One solution is using many query to warm it up. But I can't query all terms

Re: A question on implementing new operators

2014-12-02 Thread david.w.smi...@gmail.com
Hi Prasad, Firstly, the Lucene ‘general’ list is not the appropriate list; it’s the java-user lucene list so I’m replying there instead. This is mostly about query parsing. If you look at Lucene’s modules, you’ll see a “queryparser” module. In there, there’s a “flexible” package which is named