[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-16 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744076#comment-16744076
 ] 

ASF subversion and git services commented on LUCENE-8633:
-

Commit 18c3986cc4bdb38949fae14bd771621345d41776 in lucene-solr's branch 
refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=18c3986 ]

LUCENE-8633: Remove term weighting from IntervalQuery scores


> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch, LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-16 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744077#comment-16744077
 ] 

ASF subversion and git services commented on LUCENE-8633:
-

Commit a8266492416dbc2fcf665983d1cac141e50fa9eb in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a826649 ]

LUCENE-8633: Remove term weighting from IntervalQuery scores


> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch, LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743886#comment-16743886
 ] 

Jim Ferenczi commented on LUCENE-8633:
--

Thanks Alan, +1 the patch looks good.

> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch, LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-14 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741936#comment-16741936
 ] 

Alan Woodward commented on LUCENE-8633:
---

I added a configurable pivot and exponent - by default we go for a saturation 
function with pivot of 1, but you can tune the pivot, and adding an exponent 
makes it use a sigmoid function instead (sigmoid with an exponent of 1 is 
equivalent to the saturation function).  I looked at adding the log function as 
well, but a) it isn't bounded in the same way, and b) it ends up exposing the 
IntervalScoreFunction as a public interface, which I'd rather not do.

> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch, LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-11 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740469#comment-16740469
 ] 

Jim Ferenczi commented on LUCENE-8633:
--

+1 to remove the terms statistics and to rely solely on the number and extent 
of the intervals. Choosing the pivot is really difficult though and cannot be 
computed statistically like the feature query does. Maybe we should have a 
default pivot of 1 and make it configurable in the constructor ? We could also 
make all feature functions available ? 

> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-11 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740265#comment-16740265
 ] 

Alan Woodward commented on LUCENE-8633:
---

Attached is a patch with an alternative scoring system:
* Sloppy frequency is calculated as the sum of individual interval scores.  
Each interval is scored as 1/(length - minExtent + 1), where minExtent() is a 
new method on IntervalsSource that exposes the minimum possible length of an 
interval produced by that source.  This is based on the scoring mechanism 
described in Vigna's paper describing intervals[1]
* In order to keep the score bounded so that it can be used as a proximity 
boost without wrecking max-score optimizations, the sloppy frequency is 
converted to a score using a saturation function.  I've chosen 5 as a pivot 
here more-or-less at random (meaning that documents containing 5 intervals of 
minimum possible length will get a score of boost * 0.5) - better ways of 
choosing a pivot are welcome.

[1] 
http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf

> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org