Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Doug Cutting wrote: The graph just shows that they differ, not how much better or worse they are, since the baseline is not perfect. When the top-10 is 50% different, are those 5 different hits markedly worse matches to your eye than the five th

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: The graph just shows that they differ, not how much better or worse they are, since the baseline is not perfect. When the top-10 is 50% different, are those 5 different hits markedly worse matches to your eye than the five they've displaced, or are

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case. . When results differed

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I tested it on a 5 mln index. Thanks, this is great data! Can you please tell a bit more about the experiments? In particular: . How were scores assigned to pages? Link analysis? log(number of incoming links) or OPIC? log() . How were

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting
Andrzej Bialecki wrote: I tested it on a 5 mln index. Thanks, this is great data! Can you please tell a bit more about the experiments? In particular: . How were scores assigned to pages? Link analysis? log(number of incoming links) or OPIC? . How were the queries generated? From a log

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more comple

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting
Andrzej Bialecki wrote: I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. I thought about this, but I could not see a simple way to achieve it.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more comple

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more complete version, still mostly

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. Ok, I just tested IndexSorter for now. It appears t

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. We will also need to estimate the total number of matches by extrapolating li

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me how

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me how it works for you. Dou

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Okay, I'll try to get something working fairly soon. Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
Doug Cutting wrote: Yes, this is why I was discouraged and stopped working on this. However I am now hopeful that sorting the entire index by page score and using top-1000 might work well with Nutch queries, since page score is field-independent, and I think fields cause the problems. Plus,

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: For single term queries (in Nutch - in Lucene they are rewritten to complex BooleanQueries), the hit lists are nearly identical for the first 10 hits, then they start to differ more and more as you progress along the original hit list. This is not so surprising - after a

IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
Doug Cutting wrote: The IndexOptimizer.java class in the searcher package was an old attempt to create something like what Suel calls "fancy postings". It creates an index with the top 10% scoring postings. Since documents are not renumbered one can intermix postings from this with the full