Re: [jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2006-09-06 Thread Paul Elschot
There is another aspect of this influences future performance: When a Matcher is used to implement filtering, it can also be pushed down into a boolean query as a required "clause". It would then end up being called in the tight loop of ConjunctionScorer as one of the required "clauses", instead

Re: [jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2006-09-06 Thread eks dev
"Keep in mind that BitSetIterator is fast for iteration over all it's bits. If it's used as a filter (with skipping), I would expect it to be slower." still, DenseBitsMatcher (BitSetIterator warpped in Matcher) works faster than anything else for this case: int skip(Matcher m) throws IOExcepti

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
> "Less than M number of segments whose doc count n satisfies B*(M^c) <= > n < B*(M^(c+1)) for any c >= 0." > In other words, less than M number of segments with the same f(n). Ah, I had missed that. But I don't believe that lucene currently obeys this in all cases. I think it does hold for n

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Marvin Humphrey
On Sep 5, 2006, at 3:28 PM, Ning Li wrote: Given M, B and an index which has L (0 < L < M) segments with docs less than B, how many ram docs should be accumulated before a merge is triggered? B is not good. B-sum(L) is the old strategy which has problems. So between B-sum(L) and B? Once there a

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Yonik Seeley
Just brainstorming a little... Assuming B=1000, M=10 (I think better with concrete examples) It seems like we should avoid unnecessary merging, allowing up to 9 segments of 1000 documents or less w/o merging. When we reach 10 segments, they should be merged into a single segment. Let's assume a

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
So, I *think* most of our hypothetical problems go away with a simple adjustment to f(n): f(n) = floor(log_M((n-1)/B)) Correct. And nice. :-) Equivalently, f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)). So f(n) = 0 means n <= B. --

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Yonik Seeley
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: So cut the Gordian Knot? http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel :-) Interesting stuff... So it looks like you have intermediate things that aren't lucene segments, but end up producing valid lucene segments at the end o

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Marvin Humphrey
On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote: So it looks like you have intermediate things that aren't lucene segments, but end up producing valid lucene segments at the end of a session? That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized i

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Yonik Seeley
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote: > So it looks like you have intermediate things that aren't lucene > segments, but end up producing valid lucene segments at the end of a > session? That's one way of thinking about it. There

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread jason rutherglen
Sounds interesting Marvin, I would be willing to test out what you create. I am working on trying creating a rapidly updating index and it sounds like this may help that. I've noticed even using a ramdisk that the whole merging process is quite slow. Maybe also because of the locking that occ

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Marvin Humphrey
On Sep 6, 2006, at 12:06 PM, Yonik Seeley wrote: Hmmm, not rewriting stored fields is nice. I guess that could apply to anything that's strictly document specific, such as term vectors. Yes. Remember the old benchmarks I posted a few months ago? KinoSearch's performance was much closer to

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Yonik Seeley
On 9/6/06, Ning Li <[EMAIL PROTECTED]> wrote: > So, I *think* most of our hypothetical problems go away with a simple > adjustment to f(n): > > f(n) = floor(log_M((n-1)/B)) Correct. And nice. :-) Equivalently, f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)). So f

Re: [jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2006-09-06 Thread Yonik Seeley
On 9/6/06, eks dev <[EMAIL PROTECTED]> wrote: still, DenseBitsMatcher (BitSetIterator warpped in Matcher) works faster than anything else for this case: int skip(Matcher m) throws IOException{ int doc=-1, ret = 0; while(m.skipTo(doc+1)){ doc = m.doc(); ret+=doc;

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
So what's left... maxMergeDocs I guess. Capping the segment size breaks the simple invariants a bit. Correct. We also need to be able to handle changes to M and maxMergeDocs between different IndexWriter sessions. When checking for a merge for Hmmm. A change of M could easily break the inva

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized index entries. At the end of a session, those are sorted, pulled apart, and used to write the tis, tii, frq, and prx files. Interesting. Whe

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Marvin Humphrey
On Sep 6, 2006, at 4:23 PM, Ning Li wrote: When do you add "merge-worthy" segments? I'd guess at the end of a session, when it's easy to decide which segments are "merge-worthy". Right. KS sorts the segments by size, then tries to merge the smallest away. The calculation uses the fibonacc