Re: When to use or not use stem field

Alessandro Benedetti Tue, 27 Apr 2021 05:07:44 -0700

Elaborating on top of the already good answers:
"Out of the box, the scoring will already take care of it."
Are we sure? I mean, it will "mostly" take care of it.

When using multi-field search, you can approach scoring in different ways,
for example using edismax and the tie factor you can move from a pure
disjunction query to a pure boolean query and anything in the middle to
calculate score.
a query "term1" on the fields qf= text text_stemmed produces:

Query Term = term1
Stemmed Query Term = term

*Pure Disjunction*
text:term1 | text_stemmed:term
The score is the max scoring clause.
For a document that contains the exact term "term1" the winning clause
could be any of the two.

*term1* in the field text has term frequency TF1 and document frequency DF1
*term* in the field text_stemmed has term frequency TF and document
frequency DF
TF >= TF1 (= if only term1 was originally present in the field, > if term1,
term2, term were present and stemmed to 'term')
IDF <= IDF1 (= if only term1 was originally present in the field in the
corpus, > if term1, term2, term were present and stemmed to 'term')

Documents containing different terms may have matches with higher or lower
TF, while DF is always going to be >=.
BM25 approaches saturation for the impact on the score of Term Frequency,
still you may get the winning clause to derive from text_stemmed:term
because of term frequency.
So I think we can say that the exact match is likely to win because of the
Inverse Document Frequency factor, but it's not guaranteed in a pure
disjunction.

e.g.
*Doc1*
text: "*term1* bla bla bla bla"
TF(stemmed)= 1
TF1(un-stemmed)=1
DF1=100
DF=101

*Doc2*:
text:"*term2* *term3* *term4* *term5* bla bla *term6* bla bla"
TF(stemmed)= 5
DF= 101
TF1(un-stemmed)=0 - no match

*Pure Boolean*
text:term1 | text_stemmed:term
The score is the sum of the scoring clauses.
But the observation is similar:
Depending on the Term Frequency, we are going to likely see a better score
for documents matching the exact term in the field 'text' (caused by the
fact that the exact term in the field 'text' has higher inverse document
frequency and we sum the stemmed counterpart).
But not always because the Inverse Document Frequency could not compensate
enough.

I know many other factors affect the score, but without boosting to a
certain extent (what extent is not easy to say), I don't think we can
guarantee the un-stemmed match wins.

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io

On Fri, 23 Apr 2021 at 12:35, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hallo,
>
> I would use both at the same time. You do not always want to find all
> stemmed forms of a term, but the unstemmed form instead, or at least have
> the latter being scored higher. Out of the box, the scoring will already
> take care of it.
>
> Although i actually prefer both in one field, using the KeywordRepeat
> filter. But that leads to other headaches that require even more work to
> fix it. Use both fields and keep it simple.
>
> Regards,
> Markus
>
> Op vr 23 apr. 2021 om 11:50 schreef The Maverick <maveric...@posteo.de>:
>
> > Hello
> >
> > I have aschema with two fields
> > One is stemmed and one isn't.
> > When I would use the stemmed field in my search. ( or when I shouldn't do
> > it )
> >
> > Regards
> > S
> >
>

Re: When to use or not use stem field

Reply via email to