Re: solr relatedness weirdness on json facet function

Michael Gibney Wed, 06 Apr 2022 09:14:24 -0700

I think the behavior you're seeing is a consequence of the fact that you're
applying index-time stopword filtering *before* the tokens are further
manipulated by WordDelimiterGraphFilter. E.g.:


"the token-is-retained" => "the" "token-is" "retained" => "the" "token"
"is" "retained"

In the case above, "token-is" doesn't match the stopword list, but it is
subsequently decomposed into "token" and "is". So in fact you're likely
getting other unexpected query behavior; it's just easier to see/more
explicit with faceting.

It may also be worth noting that faceting on tokenized fields (TextField)
currently only works via "uninversion" of indexed field values (i.e.,
docValues are not supported). This can be quite resource-intensive, and is
probably best to avoid unless you have a specific need to do this (which
you may well indeed have!).

Also, index-time WordDelimiterGraphFilter configured to both "split" and
"catenate" tokens can yield subtly strange results in phrase queries, if
that matters to you.

Michael

On Wed, Apr 6, 2022 at 4:52 AM Dan Rosher <rosh...@gmail.com> wrote:

> Hi Michael,
>
> Here are the field and fieldType with a result snippet.
>
> I've checked the stopword list, and words like "a" or "be"  are in it. I've
> also used the UI analysis to check that they indeed should be removed when
> indexed and queried.
>
> Many thanks,
> Dan
>
> *example results:*
> ....
>   "facets": {
>     "count": 58215,
>     "description": {
>       "buckets": [
>         {
>           "val": "a",
>           "count": 4,
>           "relatedness": {
>             "relatedness": 0.98239,
>             "foreground_popularity": 0.01279,
>             "background_popularity": 0.01279
>           }
>         },
>         {
>           "val": "be",
>           "count": 6,
>           "relatedness": {
>             "relatedness": 0.98239,
>             "foreground_popularity": 0.01279,
>             "background_popularity": 0.01279
>           }
>         },
> ....
>
> *field*:        <field name="description"   type="textgen-stemmed"
> indexed="true"  stored="true"  multiValued="false"/>
> *fieldtype*:
>        <fieldType name="textgen-stemmed" class="solr.TextField"
> positionIncrementGap="100">
>             <similarity class="solr.ClassicSimilarityFactory"/>
>             <analyzer type="index">
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="\.$" replacement=""/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="\.\s+" replacement=" "/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[*,;|/]" replacement=" "/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="(\S+)(\.(?i:net))\b" replacement="$1 $2"/>
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>                        <!-- STOPWORDS HERE -->
>                 <filter class="solr.WordDelimiterGraphFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0"/>
>                 <filter class="solr.FlattenGraphFilterFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>                 <filter class="solr.KStemFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>                      <!-- STOPWORDS HERE -->
>                 <filter class="solr.WordDelimiterGraphFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>                 <filter class="solr.KStemFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
>
> On Tue, 5 Apr 2022 at 14:58, Michael Gibney <mich...@michaelgibney.net>
> wrote:
>
> > Both `qf` and `relatedness` should be orthogonal to your question, iiuc.
> > Understanding that your question is mainly about which terms are included
> > (i.e. included at all -- nevermind ranking), then the only thing that
> > should determine that is the field and fieldType config for the terms
> facet
> > "field" property -- i.e., "description". Can you share that information,
> > including index-time analysis chain config?
> >
> > On Tue, Apr 5, 2022 at 8:52 AM Dan Rosher <rosh...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > If I run a facet on relatedness on a qf field (examples below) which
> has
> > > stopword removal, I get stopwords in the json facet?
> > >
> > > Anyone know why, and if this can be avoided?
> > >
> > > Many thanks,
> > > Dan
> > >
> > > =================
> > >
> > > Details
> > > Solr 7.7.2
> > >
> > > http://localhost:8983/solr/collection/select?
> > > q=my query&
> > > defType=edismax&
> > > qf=description&
> > > fore={!type=$defType qf=$qf v=$q}&
> > > back=*:*&
> > > rows=0&
> > > json.facet={
> > >   "description":{
> > >     "type": "terms",
> > >     "field": "description",
> > >     "sort": { "relatedness": "desc"},
> > >     "mincount": 2,
> > >     "limit": 8,
> > >     "facet": {
> > >         "relatedness": {
> > >             "type": "func",
> > >             "func": "relatedness($fore,$back)"
> > >         }
> > >     }
> > >   }
> > > }
> > >
> >
>

Re: solr relatedness weirdness on json facet function

Reply via email to