[ 
https://issues.apache.org/jira/browse/LUCENE-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-8031:
--------------------------------
    Attachment: LUCENE-8031.patch

Here is a patch, i didn't yet improve tests and didn't address downgrading at 
all though.

I ran omitTF experiments: mean average precision on 3 test collections, 
different languages, with/without stopwords, with different scoring systems. 

english:

EnglishAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.3363|0.1465|0.2080|+42.0%|
|BM25|0.4492|0.2023|0.2746|+35.7%|
|I(ne)B2|0.4553|0.2151|0.2801|+30.2%|
|I(ne)B1|0.4231|0.1679|0.2539|+51.2%|
|PL2|0.3624|0.2006|0.2656|+32.4%|
|LM(dirichlet)|0.4408|0.2814|0.2851|+1.3%|
|DFI(chisquare)|0.4236|0.2493|0.2819|+13.1%|

EnglishAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.3478|0.1651|0.2052|+24.3%|
|BM25|0.4505|0.2269|0.2720|+19.9%|
|I(ne)B2|0.4563|0.2401|0.2785|+16.0%|
|I(ne)B1|0.4285|0.1992|0.2516|+26.3%|
|PL2|0.4438|0.2182|0.2617|+19.9%|
|LM(dirichlet)|0.4372|0.2827|0.2851|+0.8%|
|DFI(chisquare)|0.4380|0.2637|0.2858|+8.4%|

bengali:

BengaliAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2326|0.1211|0.1371|+13.2%|
|BM25|0.2989|0.1367|0.1673|+22.4%|
|I(ne)B2|0.3111|0.1469|0.1738|+18.3%|
|I(ne)B1|0.2886|0.1237|0.1520|+22.9%|
|PL2|0.2906|0.1372|0.1636|+19.2%|
|LM(dirichlet)|0.3007|0.1805|0.1829|+1.3%|
|DFI(chisquare)|0.2938|0.1678|0.1790|+6.7%|

BengaliAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2266|0.1231|0.1360|+10.5%|
|BM25|0.2947|0.1390|0.1649|+18.6%|
|I(ne)B2|0.3074|0.1485|0.1723|+16.0%|
|I(ne)B1|0.2848|0.1248|0.1486|+19.1%|
|PL2|0.2856|0.1377|0.1608|+16.8%|
|LM(dirichlet)|0.2982|0.1803|0.1836|+1.8%|
|DFI(chisquare)|0.2887|0.1703|0.1810|+6.3%|

kurdish:

SoraniAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2957|0.1625|0.1811|+11.4%|
|BM25|0.3207|0.1871|0.2087|+11.5%|
|I(ne)B2|0.3354|0.1937|0.2113|+9.1%|
|I(ne)B1|0.3263|0.1762|0.1992|+13.1%|
|PL2|0.3134|0.1738|0.2002|+15.2%|
|LM(dirichlet)|0.2877|0.2130|0.2149|+0.9%|
|DFI(chisquare)|0.3157|0.2014|0.2129|+5.7%|

SoraniAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2977|0.1654|0.1781|+7.7%|
|BM25|0.3205|0.1918|0.2077|+8.3%|
|I(ne)B2|0.3345|0.1979|0.2107|+6.5%|
|I(ne)B1|0.3266|0.1798|0.1970|+9.6%|
|PL2|0.3115|0.1761|0.1998|+13.5%|
|LM(dirichlet)|0.2815|0.2116|0.2144|+1.3%|
|DFI(chisquare)|0.3143|0.2022|0.2115|+4.6%|


> DOCS_ONLY fields set incorrect length norms
> -------------------------------------------
>
>                 Key: LUCENE-8031
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8031
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-8031.patch
>
>
> Term frequencies are discarded in the DOCS_ONLY case from the postings list 
> but they still count against the length normalization, which looks like it 
> may screw stuff up.
> I ran some quick experiments on LUCENE-8025, by encoding 
> fieldInvertState.getUniqueTermCount() and it seemed worth fixing (e.g. 20% or 
> 30% improvement potentially). Happy to do testing for real, if we want to fix.
> But this seems tricky, today you can downgrade to DOCS_ONLY on the fly, and 
> its hard for me to think about that case (i think its generally screwed up 
> besides this, but still).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to