Add a scoring DistanceQuery that does not need caches and separate filters
> --
>
> Key: LUCENE-2395
> URL: https://issues.apache.org/jira/browse/LUCENE-2395
> Proj
[
https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler updated LUCENE-2395:
--
Attachment: (was: DistanceQuery.java)
> Add a scoring DistanceQuery that does not n
[
https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler updated LUCENE-2395:
--
Attachment: DistanceQuery.java
small updates to Chris' patches.
> Add a
classes are missing
(coming with Chris' later patches), but it shows how it should work and how its
customizeable.
> Add a scoring DistanceQuery that does not need caches and separate filters
> --
>
>
thought about the broken distance query in contrib. It lacks the following
features:
- It needs a query/filter for the enclosing bbox (which is constant score)
- It needs a separate filter for filtering out hits to far away (inside bbox
but outside distance limit)
- It has no scoring, so if somebody
sses the current problems with caching calculated distances and
means that Spatial will work with per segment.
> Add a scoring DistanceQuery that does not need caches and separate filters
> --
>
>
Add a scoring DistanceQuery that does not need caches and separate filters
--
Key: LUCENE-2395
URL: https://issues.apache.org/jira/browse/LUCENE-2395
Project: Lucene - Java
why have it ordered at all?
> Enable flexible scoring
> ---
>
> Key: LUCENE-2392
> URL: https://issues.apache.org/jira/browse/LUCENE-2392
> Project: Lucene - Java
> Issue Type: Improvement
>
above. I misunderstood that the stats I
need are stored per-field per-doc. So that will allow me to compute the
docLength as I want.
> Enable flexible scoring
> ---
>
> Key: LUCENE-2392
> URL: https://issues.apache.org
the "baby steps" part of the original thread).
Ie, the IR world seems to have converged on a smallish set of "stats"
that are commonly required, so I'd like to make those initial stats
work well, for starters. Commit that (it enables all sorts of state
of the art scoring mo
(~171 words per doc on avg).
> Enable flexible scoring
> ---
>
> Key: LUCENE-2392
> URL: https://issues.apache.org/jira/browse/LUCENE-2392
> Project: Lucene - Java
> Issue Type: Improvement
>
e doc
Length as one perceives it. Why is that problematic?
What Mike opened is an issue titled "enable flexible scoring" ... what I'm
asking for falls under that hood?
Also, maybe we should have that discussion on the issue?
Shai
On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir wrot
ow that length is computed. Wherever we write the norms, we'll
> call that impl, which by default will do what Lucene does today?
> I think though that it's not a field-level setting, but an IW one?
>
> > Enable flexible scoring
> > ---
> >
uted. Wherever we write the norms, we'll call that impl,
which by default will do what Lucene does today?
I think though that it's not a field-level setting, but an IW one?
> Enable flexible scoring
> ---
>
> Key: LUCENE-2392
>
d the
discountOverlaps=false (no longer the default)
should be considered deprecated compatibility behavior :)
> Enable flexible scoring
> ---
>
> Key: LUCENE-2392
> URL: https://issues.apache.org/jira/browse/LUCENE-2392
>
ble scoring
> ---
>
> Key: LUCENE-2392
> URL: https://issues.apache.org/jira/browse/LUCENE-2392
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
>R
Enable flexible scoring
---
Key: LUCENE-2392
URL: https://issues.apache.org/jira/browse/LUCENE-2392
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Michael McCandless
> "flexible matching", which is more expansive than "flexible scoring"?
>>
>> I think so. Maybe it shouldn't be called a Similarity (which to me
>> (though, carrying a heavy curse of knowledge burden...) means
>> "scoring")? Matcher?
&g
he way a field is tokenized is part of its field definition, thus
> the Analyzer is part of the field definition, thus the analyzer is part of the
> schema and needs to be stored with the index.
OK.
> Still, we support different Analyzers at search time by way of QueryParser.
> QueryParser
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> > Maybe aggressive automatic data-reduction makes more sense in the context of
> > "flexible matching", which is more expansive than "flexible scoring"?
>
> I think so. Maybe it shouldn
ers at search time by way of QueryParser.
QueryParser's constructor requires a Schema, but also accepts an optional
Analyzer which if supplied will be used instead of the Analyzers from the
Schema.
> > Maybe aggressive automatic data-reduction makes more sense in the context of
> &g
t;> Ie so the chosen Sim can properly recompute all boost bytes (if it uses
>> those), for scoring models that "pivot" based on avg's of these stats?
>
> Yes, we could support that.
>
> It's not high on my todo-list for core Lucy, though: poor payoff for
back!
The change you suggested works; now it compiles without problems (I used lucene
3.0.1)
Best regards,
Katja
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
&g
osen Sim can properly recompute all boost bytes (if it uses
> those), for scoring models that "pivot" based on avg's of these stats?
Yes, we could support that.
It's not high on my todo-list for core Lucy, though: poor payoff for all the
complexity it would introduce,
dd BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
>
[javac] float fieldNorm =
this.getSimilarity().decodeNormValue(norms[i][this.docID()]);
[javac]
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-20
On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey wrote:
> On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
>> I mean specifically one should not have to commit to the precise
>> scoring model they will use for a given field, when they index that
>> field.
On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
> I mean specifically one should not have to commit to the precise
> scoring model they will use for a given field, when they index that
> field.
Yeah, I've never seen committing to a precise scoring model at inde
>>> But I don't like baking in search concepts at index time...
>>
> Many scoring models are possible if you store enough stats in the
> index.
>
in general the missing stats seem to fit in two buckets/categories:
1) length normalization pivot: average length in
pts baked in, and a flat file would
> be best. :)
>
> Seriously... optimizing on-disk data structures to accommodate anticipated
> search query patterns and maximize speed and relevance... that's what
> indexing's all about, ain't it?
You're over-reading into
est. :)
Seriously... optimizing on-disk data structures to accommodate anticipated
search query patterns and maximize speed and relevance... that's what
indexing's all about, ain't it?
And what class other than Similarity knows enough about the scoring algorithm
to perform these d
ty judgments.
>> > However, that polymorphism would be handled internally -- it wouldn't be
>> > the
>> > responsibility of the user to determine whether a codec supported a
>> > particular
>> > scoring model.
>>
>> Is that "yes&q
On Thu, Mar 11, 2010 at 12:35 PM, Marvin Humphrey
wrote:
> On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:
>
>> We ask it to give us a Codec.
>
> There's a conflict between the segment-wide role of the "Codec" class and its
> role as specifier for posting format.
>
> In some se
ternally -- it wouldn't be the
> > responsibility of the user to determine whether a codec supported a
> > particular
> > scoring model.
>
> Is that "yes" (a user can do MatchOnlySim at search time" if the field
> were indexed with B25Sim)?
In ess
On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:
> We ask it to give us a Codec.
There's a conflict between the segment-wide role of the "Codec" class and its
role as specifier for posting format.
In some sense, you could argue that the "codec" reads/writes the entire index
se
ng, but then sometimes
>> >> use match-only and sometimes full-scoring when querying against that
>> >> field?
>> >
>> > The same way that Lucene knows that sometimes it needs a docs-only-enum and
>> > sometimes it needs a docs-and-positions enum.
On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote:
>
> >> You said "of course" before but... how in your proposal could one
> >> store all stats for a given field during indexing, but then sometimes
> >> use match-only and sometimes
, it wasn't enforced.
OK.
>> You said "of course" before but... how in your proposal could one
>> store all stats for a given field during indexing, but then sometimes
>> use match-only and sometimes full-scoring when querying against that
>> field?
>
> Th
> store all stats for a given field during indexing, but then sometimes
> use match-only and sometimes full-scoring when querying against that
> field?
The same way that Lucene knows that sometimes it needs a docs-only-enum and
sometimes it needs a docs-and-positions enum. Sometimes you ne
roximately the same amount of work no matter how you time-shift it.
Yes.
>> I do agree there's some connection -- if I don't store tf nor
>> positions then I can't use a Sim that needs these stats.
>>
>> > I also like the idea of novice/intermediate users
these stats.
>
> > I also like the idea of novice/intermediate users being able to express the
> > intent for how a field gets scored by choosing a Similarity subclass,
> > without
> > having to worry about the underlying details of posting format.
>
> Well.. I
On 03/08/2010 at 2:10 PM, Michael McCandless wrote:
> On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote:
> > On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
> > > On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> > > > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> > > > wrote:
> > > > >
On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote:
> On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
>> On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
>> > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
>> > wrote:
>> > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
> On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> > wrote:
> > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > > > > What's the flex API for specifying a custom postin
On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> wrote:
> > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > > > What's the flex API for specifying a custom posting format?
> > >
> > > You implement a Codecs class, whic
gt; be the class Lucene uses to get reader/writer for other parts of the
>> index.
>
> Huh? What does the posting format specifier have to do with e.g. stored
> fields?
>
> What you're describing sounds more like the Architecture class in KinoSearch.
OK.
>> I'
> I'm a little confused: if I indexed a field with full postings data,
> shouldn't I still be allowed score with match only scoring?
Of course.
> When a movie is encoded to a file, the codec(s) determine all sorts of
> interesting details. Then when you watch the movie you
.0.1, please let me know and I will try to handle the changes.
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>
guaranteed to have consistently random distribution of field lengths across
> nodes.
>
> Hoss had a good example illustrating why per-node IDF doesn't always work well
> in a cluster: search cluster of news content with nodes divided by year, and
> the top scoring hit for "iphone&
eight are all changed. Does anyone have a modified
version of BM25 classes which works with latest version of Lucene?
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-20
strating why per-node IDF doesn't always work well
in a cluster: search cluster of news content with nodes divided by year, and
the top scoring hit for "iphone" is a misspelling from 1997 (because it was an
extremely rare term on that search node).
Similarly, if you calc field length s
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey wrote:
> On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
>> The problem is, these scoring models need the avg field length (in
>> tokens) across the entire index, to compute the norms.
>>
>> Ie, you
On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
> The problem is, these scoring models need the avg field length (in
> tokens) across the entire index, to compute the norms.
>
> Ie, you can't do that on writing a single segment.
I don't see why
earch time.
> Even in Lucene, it seems odd to want to calculate all of those on
> the fly each time you open an index. It seems to me that this is a
> specialized need of BM25.
The problem is, these scoring models need the avg field length (in
tokens) across the entire index, to compu
ty is where we decode norms right now. In my opinion, it should be
the Similarity object from which we specify per-field posting formats.
See my reply to Robert in the BM25 thread:
http://markmail.org/message/77rmrfmpatxd3p2e
That way, custom scoring implementations can guarantee that
In thinking about & discussing with Robert how to allow Lucene to
support other scoring models, eg lnu.ltc, BM25, etc I think a
relatively contained set of changes can give us a solid step forward.
Something like this:
* Store additional per-doc stats in the index, eg in a cu
r strictly positive to avoid terms being ignored at all.
{quote}
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Ja
stopwords list is used. I'm curious what you think
about this as it looks like a potential improvement for people not using
stopwords (multilingual situation, etc)
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
>
it looks like a potential improvement for people not using
stopwords (multilingual situation, etc)
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
>
ncy about probability of variants given the other input terms in
the query but that feels like its straying into spell checker territory and
ngrams etc.
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issue
n factors we have to hand - the IDF
of the user's supposedly valid input and the similarity measure of each variant
compared to the input.
We could get fancy about probability of variants given the other input terms in
the query but that feels like its straying into spell checker territory an
ese two freqs bring some
easy precision points (HF-LF Pairs are much more likely to be typos that two
HF-HF... ).
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
&
sn't as simple as offering a choice between
preserving IDF for all terms or not.
Mark, right, my mistake. I will move this patch to LUCENE-124 so there is a
simple alternative, you can proceed here with a smarter method... sorry i got
confused amongst the different issues :)
> Fuzz
[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-329:
---
Attachment: (was: LUCENE-329.patch)
> Fuzzy query scoring iss
sn't as simple as offering a choice between
preserving IDF for all terms or not.
Instead, it is a proposal that we should use the *input* term's IDF for scoring
all variants of the same root term (or taking an average of variants where the
root term does not exist).
This I feel p
[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler reassigned LUCENE-329:
Assignee: (was: Lucene Developers)
> Fuzzy query scoring iss
[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler reassigned LUCENE-329:
Assignee: (was: Lucene Developers)
> Fuzzy query scoring iss
. You can still create a
'smarter' method here, it won't get in the way as now FuzzyQuery does not have
a hardcoded rewrite method.
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://i
. You can still create a
'smarter' method here, it won't get in the way as now FuzzyQuery does not have
a hardcoded rewrite method.
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://i
[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-329:
---
Attachment: LUCENE-329.patch
here is a rough patch
> Fuzzy query scoring iss
[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-329:
---
Attachment: LUCENE-329.patch
here is a rough patch
> Fuzzy query scoring iss
r would lose this feature (available in FuzzyLikeThisQuery)
Mark, it wouldn't lose any features. we simply provide another option, just
like we do for other MultiTermQuery rewrites for other queries, so users can
choose what they want to use. its just an additional choice.
> F
r would lose this feature (available in FuzzyLikeThisQuery)
Mark, it wouldn't lose any features. we simply provide another option, just
like we do for other MultiTermQuery rewrites for other queries, so users can
choose what they want to use. its just an additional choice.
> F
r would lose this feature (available in FuzzyLikeThisQuery)
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>
r would lose this feature (available in FuzzyLikeThisQuery)
> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>
nice way.
we can make an alternative rewrite method for fuzzy that does just like
TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries
instead. this way the score will be equal to the boost.
then users could choose which one they want to use.
> Fuzzy query scoring
nice way.
we can make an alternative rewrite method for fuzzy that does just like
TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries
instead. this way the score will be equal to the boost.
then users could choose which one they want to use.
> Fuzzy query scoring
Similarity can only be set per index, but I may want to adjust scoring
behaviour at a field level
-
Key: LUCENE-2236
URL: https://issues.apache.org/jira/browse/LUCENE
[
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Miller updated LUCENE-2130:
Fix Version/s: Flex Branch
> Investigate Rewriting Constant Scoring MultiTermQueries per segm
[
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Miller updated LUCENE-2130:
Attachment: LUCENE-2130.patch
updated
> Investigate Rewriting Constant Scoring MultiTermQuer
subreader and grab the actual constantscore weight. It works I think - but its
a little ugly.
I've spewed too much confusion in this issue - just going to rewrite the
summary.
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
mode does still use the booleanquery (of course, why else have it) - but its
only going to be with few clauses, so neither is really a benefit.)
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
>
>
mmary - you would't apply a huge boolean
query - you'd just have a sparser filter. This might not be that beneficial.
* edit *
Smaller, sparser filter?
> Investigate Rewriting Constant Scor
n,
they frequently scan the entire term dictionary only to return a few results.
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
>
> Key: LUCENE-2130
> URL
6 AM:
--
The ugly patch - (which doesn't yet handle the filter supplied case)
was (Author: markrmil...@gmail.com):
The ugly patch
> Investigate Rewriting Constant Scoring MultiTermQueries
ould't apply a huge boolean
query - you'd just have a sparser filter. This might not be that beneficial.
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
>
> Key: LUCE
[
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Miller updated LUCENE-2130:
Attachment: LUCENE-2130.patch
The ugly patch
> Investigate Rewriting Constant Scor
the advantage when you are enumerating a
lot of terms is that you avoid DirectoryReaders MultiTermEnum and its PQ.
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
>
>
you would't apply a huge boolean query
- you'd just have a sparser filter. This might not be that beneficial.
> Investigate Rewriting Constant Scoring MultiTermQueries per segment
> ---
>
>
Investigate Rewriting Constant Scoring MultiTermQueries per segment
---
Key: LUCENE-2130
URL: https://issues.apache.org/jira/browse/LUCENE-2130
Project: Lucene - Java
Issue
one can try.
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Compone
e, I tried modifying length normalization with
SweetSpot etc as others have done in the past. For this corpus I was unable to
improve it in this way.
Yeah, can't speak for SweetSpot, but there are other approaches too that don't
favor shorter docs all the time.
> Add BM25
also donated an implementation of the
Axiomatic Retr. Function.
I've never been able to get that scoring function to do anything more than be
consistently worse than the default Lucene formula. I tried at least 3 test
collections with it...
bq. I'm also curious if anyone has compared BM
yet, but...
Should we take just a small step back and consider what it would take to
actually make scoring more pluggable instead of just thinking about how best to
integrate BM25? In other words, someone else has also donated an
implementation of the Axiomatic Retr. Function. Much like BM25
ad make a dedicated posting list, which would
be properly merged, but we'd then have to re-walk to compute the stats for the
newly merged segment.
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issu
ery type), as far
as frequency and docFreq of the phrase/terms are available.
At this point it is not supported in the patch, but I don't see any reason why
it couldn't be implemented, moreover that I don't really know is how to do it
:-).
> Add BM25 Scoring to Lucene
&g
ery type), as far
as frequency and docFreq of the phrase/terms are available.
At this point it is not supported in the patch, but I don't see any reason why
it couldn't be implemented, moreover that I don't really know how to do it :-).
> Add BM25 Scoring to Lucene
>
requency and docFreq of the phrase/terms are available.
At this point it is not supported in the patch, but I don't see any reason why
it couldn't be implemented, moreover that I don't really know how to do it :-).
> Add BM25 Scoring to Lucene
> --
b:x2 TermWeigth will calculate the IDF for Term(a, x1)
and Term(b, x2), am I missing something?
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
>
document level IDF)
Is there anything else?
bq. Only simple boolean queries based on terms are supported (with operators
or, and, not). For instance it does not support PhraseQuery.
This is concerning -- is there no way to score a PhraseQuery in BM25F?
> Add
1 - 100 of 326 matches
Mail list logo