Doug Cutting wrote:
Andrzej Bialecki wrote:
Doug Cutting wrote:
The graph just shows that they differ, not how much better or worse
they are, since the baseline is not perfect. When the top-10 is 50%
different, are those 5 different hits markedly worse matches to your
eye than the five th
Andrzej Bialecki wrote:
Doug Cutting wrote:
The graph just shows that they differ, not how much better or worse
they are, since the baseline is not perfect. When the top-10 is 50%
different, are those 5 different hits markedly worse matches to your
eye than the five they've displaced, or are
Doug Cutting wrote:
Andrzej Bialecki wrote:
. How were the queries generated? From a log or randomly?
Queries have been picked up manually, to test the worst performing
cases from a real query log.
So, for example, the 50% error rate might not be typical, but could be
worst-case.
Andrzej Bialecki wrote:
. How were the queries generated? From a log or randomly?
Queries have been picked up manually, to test the worst performing cases
from a real query log.
So, for example, the 50% error rate might not be typical, but could be
worst-case.
. When results differed
Doug Cutting wrote:
Andrzej Bialecki wrote:
I tested it on a 5 mln index.
Thanks, this is great data!
Can you please tell a bit more about the experiments? In particular:
. How were scores assigned to pages? Link analysis? log(number of
incoming links) or OPIC?
log()
. How were
Andrzej Bialecki wrote:
I tested it on a 5 mln index.
Thanks, this is great data!
Can you please tell a bit more about the experiments? In particular:
. How were scores assigned to pages? Link analysis? log(number of
incoming links) or OPIC?
. How were the queries generated? From a log
Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more comple
Andrzej Bialecki wrote:
I'll test it soon - one comment, though. Currently you use a subclass of
RuntimeException to stop the collecting. I think we should come up with
a better mechanism - throwing exceptions is too costly.
I thought about this, but I could not see a simple way to achieve it.
Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more comple
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly, at
least I get exactly the same results, with the same scores and the same
explanations, if I run the smae queries on the original and on the
sorted index.
Here's a more complete version, still mostly
Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting
list...
Yes. I was just posting the work-in-progress.
Ok, I just tested IndexSorter for now. It appears t
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting list...
Yes. I was just posting the work-in-progress.
We will also need to estimate the total number of matches by
extrapolating li
Doug Cutting wrote:
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my
knowledge of Lucene... ;-)
Attached is a class which sorts a Nutch index by boost. I have only
tested it on a ~100 page index, where it appears to work correctly.
Please tell me how
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge
of Lucene... ;-)
Attached is a class which sorts a Nutch index by boost. I have only
tested it on a ~100 page index, where it appears to work correctly.
Please tell me how it works for you.
Dou
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge
of Lucene... ;-)
Okay, I'll try to get something working fairly soon.
Doug
Doug Cutting wrote:
Yes, this is why I was discouraged and stopped working on this.
However I am now hopeful that sorting the entire index by page score
and using top-1000 might work well with Nutch queries, since page
score is field-independent, and I think fields cause the problems.
Plus,
Andrzej Bialecki wrote:
For single term queries (in Nutch - in Lucene they are rewritten to
complex BooleanQueries), the hit lists are nearly identical for the
first 10 hits, then they start to differ more and more as you progress
along the original hit list. This is not so surprising - after a
Doug Cutting wrote:
The IndexOptimizer.java class in the searcher package was an old
attempt to create something like what Suel calls "fancy postings". It
creates an index with the top 10% scoring postings. Since documents
are not renumbered one can intermix postings from this with the full
18 matches
Mail list logo