Re: Search performance with one index vs. many indexes

2005-02-27 Thread Morus Walter
Jochen Franke writes:
 Topic: Search performance with large numbers of indexes vs. one large index
 
 
 My questions are:
 
 - Is the size of the wordlist the problem?
 - Would we be a lot faster, when we have a smaller number
 of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: help with boolean expression

2005-02-27 Thread Morus Walter
Omar Didi writes:
 I have a problem understanding how would lucene iterpret this boolean 
 expression : A AND B OR C .
 it neither return the same count as when I enter (A AND B) OR C nor A AND (B 
 OR C). 
 if anyone knows how it is interpreted i would be thankful.
 thanks

A AND B OR C creates a query that requires A and B. C influcenes the 
score, but is neither sufficient nor required for a match.

IMO query parser is broken for queries mixing AND and OR without explicit
braces.
My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d'
in query parser.

I suggested a patch some time ago, but it's still pending in bugzilla.
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Don't know if it's still usable with current sources.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sorting date stored in milliseconds time

2005-02-27 Thread Morus Walter
Ben writes:
 
 I store my date in milliseconds, how can I do a sort on it? SortField
 has INT, FLOAT and STRING. Do I need to create a new sort class, to
 sort the long value?
 
Why do you need that precicion?
Remember: there's a price to pay. The memory required for sorting and
the time to set up the sort cache depends on the number of different terms,
dates in your case.
I can hardly think of an application where seconds are relevant, what do
you need milliseconds for?

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]