Re: Sequential match query

2009-04-11 Thread Tim Williams
On Sat, Apr 11, 2009 at 12:25 PM, Erick Erickson wrote: > That'll teach me to scan a post. The link I sent you > is still relevant, but wildcards are NOT intended to be used to > concatenate terms. You want a phrase query or a span query > for that. i.e. "A C F"~# where # is the "slop", that is, t

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Yonik Seeley
OK, I think this will improve the situation: https://issues.apache.org/jira/browse/LUCENE-1596 -Yonik http://www.lucidimagination.com On Fri, Apr 10, 2009 at 1:47 PM, Michael McCandless wrote: > We never fully explained it, but we have some ideas... > > It's only if you iterate each term, and d

Re: SpellChecker in use with composite query

2009-04-11 Thread Amin Mohammed-Coleman
Hi Another thing that I was wondering is how to apply the construction of the spell index. Where is the most appropriate place to create the spell index? For example: IndexReader spellReader = IndexReader.open(fsDirectory1); IndexReader spellReader2 = IndexReader.open(fsDirectory2); MultiRead

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Erick Erickson
Siiiggghhh. So that means I'll have to really look at TrieRange before I can appear competent.. Thanks Erick On Sat, Apr 11, 2009 at 11:23 AM, Uwe Schindler wrote: > This is why I invented TrieRange: Full precision dates but less terms > during > filtering/searching. With TrieRange on the longs

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
This is why I invented TrieRange: Full precision dates but less terms during filtering/searching. With TrieRange on the longs returned bay Date.getTime() you even have precision of milliseconds without any speed decrease (only bigger index size). Or double values with full precision, everything is

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Erick Erickson
OK, I scanned all the e-mails in this thread so I may be way off base, but has anyone yet asked the basic question of whether the granularity of the dates is really necessary ? Raf and Roberto: It appears you're indexing your dates down to second resolution, which is why your number of unique ter

Re: Sequential match query

2009-04-11 Thread Erick Erickson
That'll teach me to scan a post. The link I sent you is still relevant, but wildcards are NOT intended to be used to concatenate terms. You want a phrase query or a span query for that. i.e. "A C F"~# where # is the "slop", that is, the number of other terms allowed to appear between your desired t

Re: Sequential match query

2009-04-11 Thread Erick Erickson
Wildcard queries are not lowercased, so depending upon how you're indexing, that may be tripping you up. See http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a Best Erick On Fri, Apr 10, 2009 at 2:56 PM, John Seer wrote: > > Hello, > I have 3 terms and I

Re: Lucene searching across documents

2009-04-11 Thread Erick Erickson
http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of yo

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
In addition to merging each month into one index instead of all in one index, you could also do some additional optimization when using the Range filter: Just combine only those indexes needed to fulfil the range spec during search. So if somebody want to filter Jan 15 to Feb 15, only create a Mult

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
Ahhh, OK, perhaps that explains the sizable perf difference you're seeing w/ optimized vs not. I'm curious to see the results of your "merge each month into 1 index" test... Mike On Sat, Apr 11, 2009 at 9:21 AM, Roberto Franchini wrote: > On Sat, Apr 11, 2009 at 1:50 PM, Michael McCandless > w

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Roberto Franchini
On Sat, Apr 11, 2009 at 1:50 PM, Michael McCandless wrote: > Hmm then I'm a bit baffled again. > > Because, each of your "by month" indexes presumably has a unique > subset of terms for the "date_doc" field?  Meaning, a given "by month" > index will have all date_doc corresponding to that month, a

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
Hmm then I'm a bit baffled again. Because, each of your "by month" indexes presumably has a unique subset of terms for the "date_doc" field? Meaning, a given "by month" index will have all date_doc corresponding to that month, and a different "by month" index would presumably have no overlap in t

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Roberto Franchini
On Sat, Apr 11, 2009 at 11:48 AM, Michael McCandless wrote: > On Sat, Apr 11, 2009 at 5:27 AM, Raf wrote: > [cut] > > You have readers from 72 different directories, but is each directory > an optimized or unoptimized index? Hi, I'm Raffaella's collegue, and I'm the "indexer" while she is the "s

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
On Sat, Apr 11, 2009 at 5:27 AM, Raf wrote: > I have repeated my tests using a searcher and now the performance on 2.9 are > very better than those on 2.4.1, especially when the filter extracts a lot > of docs. OK, phew! > However the same search on the consolidated index is even faster This i

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
Hi Uwe, thanks for the clarification. I have repeated my tests using a searcher and now the performance on 2.9 are very better than those on 2.4.1, especially when the filter extracts a lot of docs. However the same search on the consolidated index is even faster so I have now to verify if it is b

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
Hi Raf, it would be nice to hear how it works for you! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Raf [mailto:r.ventag...@gmail.com] > Sent: Saturday, April 11, 2009 9:22 AM > To: java-user@lu

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
Ah, Your test code shows why you do not see a speed improve with 2.9: The speed improve in 2.9 is only visible for executing real searches and not getDocIdSet alone on the big MultiReader. The 2.9 search algorithm internally executes getDocIdSet not on the complete index (like you), it executes it

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
Thanks Uwe, I had already read about TrieRangeFilter on this mailing list and I thought it could be useful to solve my problem. I think I will trie it for test purposes. Unfortunately, I have now to solve the problem in a production system and I would like to avoid using a yet unreleased version.

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
No, it is a MultiReader that contains 72 (I am sorry, I wrote a wrong number last time) "single" readers. Raf On Fri, Apr 10, 2009 at 9:14 PM, Mark Miller wrote: > Raf wrote: > >> >> We have more or less 3M documents in 24 indexes and we read all of them >> using a MultiReader. >> >> > > Is this

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
Ok, here you can find some details about my tests: *MultiReader creation* IndexReader subReader; List subReaders = new ArrayList(); for (Directory dir : this.directories) { try { subReader = IndexReader.open(dir, true); subReaders.add(subReader); } catch (...) {