Re: IndexSorter optimizer

2006-01-05 Thread Doug Cutting
Andrzej Bialecki wrote: Right, I confused two bugs from different files - the other bug still exists in the committed version of the LuceneQueryOptimizer.LimitedCollector constructor, instead of super(maxHits) it should be super(numHits) - this was actually the bug, which was causing that myst

Re: IndexSorter optimizer

2006-01-04 Thread Byron Miller
Great reading and great ideas. In such a system where you have say 3 segment partitions is it possible to build a mapreduce job to efficiently fetch, retreive and update these segments? Use a map job to process a segment for deletion and somehow process that segment to create a new fetchlist from

Re: IndexSorter optimizer

2006-01-04 Thread Andrzej Bialecki
Doug Cutting wrote: Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate inde

Re: IndexSorter optimizer

2006-01-04 Thread Doug Cutting
Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched f

Re: IndexSorter optimizer

2006-01-03 Thread Byron Miller
On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) With this patch and a top result set in the xml file does that mean it will stop scanning the index at th

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: I have committed this, along with the LuceneQueryOptimizer changes. I could only find one place where I was using numDocs() instead of maxDoc(). Right, I confused two bugs from different files - the other bug still exists in the committed version of the LuceneQueryOpti

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perc

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. p

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually bett

Re: IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki
American Jeff Bowden wrote: Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at l

Re: IndexSorter optimizer

2005-12-21 Thread American Jeff Bowden
Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actual

Re: IndexSorter optimizer

2005-12-21 Thread Byron Miller
I've got 400mill db i can run this against over the next few days. -byron --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Andrzej, > > wow are really great news! > > Using the optimized index, I reported previously > that some of the > > top-scoring results were missing. As it happens, >

Re: IndexSorter optimizer

2005-12-21 Thread Stefan Groschupf
Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the "junk" pages with high tf/idf but low "boost". Since we collect up to N hits, going from higher to

IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki
Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. The reason wh