[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2016-01-19 Thread Ahmet Arslan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106999#comment-15106999
 ] 

Ahmet Arslan commented on LUCENE-6818:
--

Thanks [~rcmuir] for taking care of this.

bq. For the solr factory changes around discountOverlaps, can you make a 
separate issue for that?
Created SOLR-8570

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: 5.5, Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2016-01-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105012#comment-15105012
 ] 

Robert Muir commented on LUCENE-6818:
-

Thanks [~iorixxx] !

The norms/spans tests were added in LUCENE-6896.

Rather than a wildcard import, I moved RandomSimilarityProvider to 
similarities/RandomSimilarity, so its in the correct package. Its just used by 
LuceneTestCase.newSearcher.

I ran the test suite a few times to try to find any problems, and did some 
rudimentary relevance testing of the lucene impl and everything seems ok.

For the solr factory changes around discountOverlaps, can you make a separate 
issue for that? I'm concerned that, if the factory is not initialized properly, 
instead there will be other problems, so maybe that should really be an 
assertion or something.


> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: 5.5, Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2016-01-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105022#comment-15105022
 ] 

ASF subversion and git services commented on LUCENE-6818:
-

Commit 1725210 from [~rcmuir] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1725210 ]

LUCENE-6818: Add DFISimilarity implementing the divergence from independence 
model

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: 5.5, Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2016-01-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104996#comment-15104996
 ] 

ASF subversion and git services commented on LUCENE-6818:
-

Commit 1725205 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1725205 ]

LUCENE-6818: Add DFISimilarity implementing the divergence from independence 
model

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2015-09-29 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934878#comment-14934878
 ] 

Adrien Grand commented on LUCENE-6818:
--

bq. I think we should simply recommend to use index time boosts only with 
ClassicSimilarity.

If we only recommend on using index-time boosts on a Similarity that is not 
even the default one, maybe we should remove index-time boosts entirely? I 
opened https://issues.apache.org/jira/browse/LUCENE-6819


> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch, 
> LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2015-09-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933478#comment-14933478
 ] 

Robert Muir commented on LUCENE-6818:
-

It is not a bug, it is just always how the index-time boost in lucene has 
worked. Boosting a document at index-time is just a way for a user to make it 
artificially longer or shorter.

I don't think we should change this, it makes it much easier for people to 
experiment since all of our scoring models do this the same way. It means you 
do not have to reindex to change the Similarity, for example.

Its easy to understand this as "at search time, the similarity sees the 
"normalized" document length". All I am saying is, these scoring models just 
have to make sure they don't do something totally nuts (like return negative, 
Infinity, or NaN scores) if the user index-time boosts with extreme values: 
extreme values that might not make sense relative to e.g. the collection-level 
statistics for the field. So in my opinion all that is needed, is to add a 
`testCrazyBoosts` that looks a lot like `testCrazySpans`, and just asserts 
those things, ideally across all 256 possible norm values.

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2015-09-28 Thread Ahmet Arslan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14910174#comment-14910174
 ] 

Ahmet Arslan commented on LUCENE-6818:
--

bq. The typical solution is to do something like adjust expected:
Thanks Robert for the suggestion and explanation. Used the typical solution, 
its working now.

bq. I have not read the paper, but these are things to deal with when 
integrating into lucene.
For your information, if you want to look at, Terrier 4.0 source tree has this 
model in DFIC.java

bq.  index-time boosts work on the norm, by making the document appear shorter 
or longer, so docLen might have a "crazy" value if the user does this.
I was relying {{o.a.l.search.similarities.SimilarityBase}} for this but it 
looks like all of its subclasses (DFR, IB) have this problem. I included 
{{TestSimilarityBase#testNorms}} method in the new patch to demonstrate the 
problem. If I am not missing something obvious this is a bug, no?

> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: Trunk
>
> Attachments: LUCENE-6818.patch, LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

2015-09-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14910012#comment-14910012
 ] 

Robert Muir commented on LUCENE-6818:
-

It happens when expected = 0, caused by the craziness of how spans score (they 
will happily score a term that does not exist). In this case totalTermFreq is 
zero, which makes expected go to zero, and then later the formula produces 
infinity (which the test checks for)

The test has this explanation for how spans score terms that don't exist:

{code}
// The problem: "normal" lucene queries create scorers, returning null if 
terms dont exist
// This means they never score a term that does not exist.
// however with spans, there is only one scorer for the whole hierarchy:
// inner queries are not real queries, their boosts are ignored, etc.
{code}

The typical solution is to do something like adjust expected:

{code}
final float expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + 
stats.getNumberOfFieldTokens());
{code}

I have not read the paper, but these are things to deal with when integrating 
into lucene. Another thing to be careful about is ensuring that integration of 
lucene's boosting is really safe, index-time boosts work on the norm, by making 
the document appear shorter or longer, so docLen might have a "crazy" value if 
the user does this.


> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> --
>
> Key: LUCENE-6818
> URL: https://issues.apache.org/jira/browse/LUCENE-6818
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/query/scoring
>Affects Versions: 5.3
>Reporter: Ahmet Arslan
>Assignee: Robert Muir
>Priority: Minor
>  Labels: similarity
> Fix For: Trunk
>
> Attachments: LUCENE-6818.patch
>
>
> As explained in the 
> [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many 
> state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart 
> of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency 
> distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, 
> of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for 
> information retrieval based on measuring the divergence from 
> independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org