Andrzej Bialecki wrote:
Right, I confused two bugs from different files - the other bug still
exists in the committed version of the
LuceneQueryOptimizer.LimitedCollector constructor, instead of
super(maxHits) it should be super(numHits) - this was actually the bug,
which was causing that myst
Great reading and great ideas.
In such a system where you have say 3 segment
partitions is it possible to build a mapreduce job to
efficiently fetch, retreive and update these segments?
Use a map job to process a segment for deletion and
somehow process that segment to create a new fetchlist
from
Doug Cutting wrote:
Byron Miller wrote:
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate inde
Byron Miller wrote:
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate indexes that are
searched f
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
With this patch and a top result set in the xml file
does that mean it will stop scanning the index at th
Doug Cutting wrote:
I have committed this, along with the LuceneQueryOptimizer changes.
I could only find one place where I was using numDocs() instead of
maxDoc().
Right, I confused two bugs from different files - the other bug still
exists in the committed version of the
LuceneQueryOpti
Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring. Perhaps
NutchSimilarity.tf() should use log() instead of sqrt() when
field==content?
I don't think it's that simple, the OPIC score is what determined this
behaviour, and it doesn't correspond at all to tf/idf, but
Doug Cutting wrote:
Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf
of a term, but with a low "boost" value (the OPIC score), to outrank
pages with high "boost" but lower tf/idf of a term. This phenomenon
leads quite often to results that are perc
Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf of
a term, but with a low "boost" value (the OPIC score), to outrank pages
with high "boost" but lower tf/idf of a term. This phenomenon leads
quite often to results that are perceived as "junk", e.g. p
Andrzej Bialecki wrote:
I'm happy to report that further tests performed on a larger index seem
to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actually bett
American Jeff Bowden wrote:
Andrzej Bialecki wrote:
Hi,
I'm happy to report that further tests performed on a larger index
seem to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at l
Andrzej Bialecki wrote:
Hi,
I'm happy to report that further tests performed on a larger index
seem to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actual
I've got 400mill db i can run this against over the
next few days.
-byron
--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Andrzej,
>
> wow are really great news!
> > Using the optimized index, I reported previously
> that some of the
> > top-scoring results were missing. As it happens,
>
Hi Andrzej,
wow are really great news!
Using the optimized index, I reported previously that some of the
top-scoring results were missing. As it happens, the missing
results were typically the "junk" pages with high tf/idf but low
"boost". Since we collect up to N hits, going from higher to
Hi,
I'm happy to report that further tests performed on a larger index seem
to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actually better.
The reason wh
15 matches
Mail list logo