[ https://issues.apache.org/jira/browse/LUCENE-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-8031: -------------------------------- Attachment: LUCENE-8031.patch Here is a patch, i didn't yet improve tests and didn't address downgrading at all though. I ran omitTF experiments: mean average precision on 3 test collections, different languages, with/without stopwords, with different scoring systems. english: EnglishAnalyzer(CharArraySet.EMPTY_SET) ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.3363|0.1465|0.2080|+42.0%| |BM25|0.4492|0.2023|0.2746|+35.7%| |I(ne)B2|0.4553|0.2151|0.2801|+30.2%| |I(ne)B1|0.4231|0.1679|0.2539|+51.2%| |PL2|0.3624|0.2006|0.2656|+32.4%| |LM(dirichlet)|0.4408|0.2814|0.2851|+1.3%| |DFI(chisquare)|0.4236|0.2493|0.2819|+13.1%| EnglishAnalyzer() ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.3478|0.1651|0.2052|+24.3%| |BM25|0.4505|0.2269|0.2720|+19.9%| |I(ne)B2|0.4563|0.2401|0.2785|+16.0%| |I(ne)B1|0.4285|0.1992|0.2516|+26.3%| |PL2|0.4438|0.2182|0.2617|+19.9%| |LM(dirichlet)|0.4372|0.2827|0.2851|+0.8%| |DFI(chisquare)|0.4380|0.2637|0.2858|+8.4%| bengali: BengaliAnalyzer(CharArraySet.EMPTY_SET) ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.2326|0.1211|0.1371|+13.2%| |BM25|0.2989|0.1367|0.1673|+22.4%| |I(ne)B2|0.3111|0.1469|0.1738|+18.3%| |I(ne)B1|0.2886|0.1237|0.1520|+22.9%| |PL2|0.2906|0.1372|0.1636|+19.2%| |LM(dirichlet)|0.3007|0.1805|0.1829|+1.3%| |DFI(chisquare)|0.2938|0.1678|0.1790|+6.7%| BengaliAnalyzer() ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.2266|0.1231|0.1360|+10.5%| |BM25|0.2947|0.1390|0.1649|+18.6%| |I(ne)B2|0.3074|0.1485|0.1723|+16.0%| |I(ne)B1|0.2848|0.1248|0.1486|+19.1%| |PL2|0.2856|0.1377|0.1608|+16.8%| |LM(dirichlet)|0.2982|0.1803|0.1836|+1.8%| |DFI(chisquare)|0.2887|0.1703|0.1810|+6.3%| kurdish: SoraniAnalyzer(CharArraySet.EMPTY_SET) ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.2957|0.1625|0.1811|+11.4%| |BM25|0.3207|0.1871|0.2087|+11.5%| |I(ne)B2|0.3354|0.1937|0.2113|+9.1%| |I(ne)B1|0.3263|0.1762|0.1992|+13.1%| |PL2|0.3134|0.1738|0.2002|+15.2%| |LM(dirichlet)|0.2877|0.2130|0.2149|+0.9%| |DFI(chisquare)|0.3157|0.2014|0.2129|+5.7%| SoraniAnalyzer() ||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff|| |Classic|0.2977|0.1654|0.1781|+7.7%| |BM25|0.3205|0.1918|0.2077|+8.3%| |I(ne)B2|0.3345|0.1979|0.2107|+6.5%| |I(ne)B1|0.3266|0.1798|0.1970|+9.6%| |PL2|0.3115|0.1761|0.1998|+13.5%| |LM(dirichlet)|0.2815|0.2116|0.2144|+1.3%| |DFI(chisquare)|0.3143|0.2022|0.2115|+4.6%| > DOCS_ONLY fields set incorrect length norms > ------------------------------------------- > > Key: LUCENE-8031 > URL: https://issues.apache.org/jira/browse/LUCENE-8031 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > Attachments: LUCENE-8031.patch > > > Term frequencies are discarded in the DOCS_ONLY case from the postings list > but they still count against the length normalization, which looks like it > may screw stuff up. > I ran some quick experiments on LUCENE-8025, by encoding > fieldInvertState.getUniqueTermCount() and it seemed worth fixing (e.g. 20% or > 30% improvement potentially). Happy to do testing for real, if we want to fix. > But this seems tricky, today you can downgrade to DOCS_ONLY on the fly, and > its hard for me to think about that case (i think its generally screwed up > besides this, but still). -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org