There is another aspect of this influences future performance:
When a Matcher is used to implement filtering, it can also be pushed down
into a boolean query as a required "clause". It would then end up
being called in the tight loop of ConjunctionScorer as one of the
required "clauses", instead
"Keep in mind that BitSetIterator is fast for iteration over all it's bits.
If it's used as a filter (with skipping), I would expect it to be slower."
still, DenseBitsMatcher (BitSetIterator warpped in Matcher) works faster than
anything else for this case:
int skip(Matcher m) throws IOExcepti
> "Less than M number of segments whose doc count n satisfies B*(M^c) <=
> n < B*(M^(c+1)) for any c >= 0."
> In other words, less than M number of segments with the same f(n).
Ah, I had missed that. But I don't believe that lucene currently
obeys this in all cases.
I think it does hold for n
On Sep 5, 2006, at 3:28 PM, Ning Li wrote:
Given M, B and an index which has L (0 < L < M) segments with docs
less than B, how many ram docs should be accumulated before a merge is
triggered? B is not good. B-sum(L) is the old strategy which has
problems. So between B-sum(L) and B? Once there a
Just brainstorming a little...
Assuming B=1000, M=10 (I think better with concrete examples)
It seems like we should avoid unnecessary merging, allowing up to 9
segments of 1000 documents or less w/o merging. When we reach 10
segments, they should be merged into a single segment. Let's assume a
So, I *think* most of our hypothetical problems go away with a simple
adjustment to f(n):
f(n) = floor(log_M((n-1)/B))
Correct. And nice. :-)
Equivalently,
f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)).
So f(n) = 0 means n <= B.
--
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
So cut the Gordian Knot?
http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel
:-) Interesting stuff...
So it looks like you have intermediate things that aren't lucene
segments, but end up producing valid lucene segments at the end o
On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:
So it looks like you have intermediate things that aren't lucene
segments, but end up producing valid lucene segments at the end of a
session?
That's one way of thinking about it. There's only one "thing"
though: a big bucket of serialized i
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:
> So it looks like you have intermediate things that aren't lucene
> segments, but end up producing valid lucene segments at the end of a
> session?
That's one way of thinking about it. There
Sounds interesting Marvin, I would be willing to test out what you create. I
am working on trying creating a rapidly updating index and it sounds like this
may help that. I've noticed even using a ramdisk that the whole merging
process is quite slow. Maybe also because of the locking that occ
On Sep 6, 2006, at 12:06 PM, Yonik Seeley wrote:
Hmmm, not rewriting stored fields is nice.
I guess that could apply to anything that's strictly document
specific, such as term vectors.
Yes. Remember the old benchmarks I posted a few months ago?
KinoSearch's performance was much closer to
On 9/6/06, Ning Li <[EMAIL PROTECTED]> wrote:
> So, I *think* most of our hypothetical problems go away with a simple
> adjustment to f(n):
>
> f(n) = floor(log_M((n-1)/B))
Correct. And nice. :-)
Equivalently,
f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)).
So f
On 9/6/06, eks dev <[EMAIL PROTECTED]> wrote:
still, DenseBitsMatcher (BitSetIterator warpped in Matcher) works faster than
anything else for this case:
int skip(Matcher m) throws IOException{
int doc=-1, ret = 0;
while(m.skipTo(doc+1)){
doc = m.doc();
ret+=doc;
So what's left... maxMergeDocs I guess.
Capping the segment size breaks the simple invariants a bit.
Correct.
We also need to be able to handle changes to M and maxMergeDocs
between different IndexWriter sessions. When checking for a merge for
Hmmm. A change of M could easily break the inva
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
That's one way of thinking about it. There's only one "thing"
though: a big bucket of serialized index entries. At the end of a
session, those are sorted, pulled apart, and used to write the tis,
tii, frq, and prx files.
Interesting.
Whe
On Sep 6, 2006, at 4:23 PM, Ning Li wrote:
When do you add "merge-worthy" segments? I'd guess at the end of a
session, when it's easy to decide which segments are "merge-worthy".
Right. KS sorts the segments by size, then tries to merge the
smallest away. The calculation uses the fibonacc
16 matches
Mail list logo