Re: Solr limit in words search - take 2

Michael Gibney Wed, 17 Nov 2021 10:14:57 -0800

Right, sorry I forgot to mention the absence of FlattenGraphFilter. Tbh I'm
not 100% clear on what cases it helps out with; but at the end of the day
it has no effect on underlying issues having to do with the fact that if
your index-time analysis chain produces "graph" tokenstreams, the Lucene
`[Default]IndexingChain` completely disregards the PositionLengthAttribute,
which is necessary to properly reconstruct the indexed graph at query time.


It's possible FlattenGraphFilter might help your case -- in fact if you do
nothing else I'd certainly suggest that you use it. But I'm certain that
there are some classes of problems that are fundamentally related to
LUCENE-4312, and FlattenGraphFilter can't fix them. I'll be curious to know
whether the addition of FlattenGraphFilter helps in your case, though!

Michael

On Wed, Nov 17, 2021 at 12:57 PM Scott <[email protected]> wrote:

> Could this be related ?
>
>
> https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-WordDelimiterGraphFilter
>
> "If you use this filter during indexing, you must follow it with a Flatten
> Graph Filter to squash tokens on top of one another like the Word Delimiter
> Filter, because the indexer can’t directly consume a graph. To get fully
> correct positional queries when tokens are split, you should instead use
> this filter at query time."
>
>
>
> -----Original Message-----
> From: Michael Gibney <[email protected]>
> Sent: Wednesday, November 17, 2021 12:07 PM
> To: [email protected]
> Subject: Re: Solr limit in words search - take 2
>
> This is not the most thorough answer, but hopefully gets you headed in the
> right direction:
>
> Very strange things can happen when your index-time analysis chain
> generates "graph" token-streams (as yours does). A couple of things you
> could try:
> 1. experiment with setting `enableGraphQueries=false` on the fieldtype 2.
> upgrading to solr >=8.1 may address your issue partially, via
> LUCENE-8730 -- here I go out on a limb in guessing that you're not
> _already_ on 8.1+ :-) 3. increase the phrase slop param, to be more lenient
> in matching "phrases". (as I say this I'm not sure it would actually help
> your case, because you're dealing with explicit phrases, and iirc phrase
> slop may only configure _implicit_ ("pf") phrase searches?)
>
> The _best_ approach would be to configure your index-time analysis
> chain(s) so that they don't have multi-term "expand" synonyms, and WDGF
> either only splits ("generate*Parts", etc.) or only catenates ("catenate*",
> "preserveOriginal"). One approach that can work is to index into two
> fields, each with a dedicated index-time analysis type (split or catenate).
>
> Some relevant issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> Michael
>
> On Wed, Nov 17, 2021 at 11:18 AM Scott <[email protected]> wrote:
>
> > My apologies for the previous e-mail…should have never sent that as
> > html
> >
> > I am facing a weird issue, possibly caused by my config.
> >
> > I have indexed a document which has a field called subject, subject is
> > defined as:
> >
> > <field name="subject" type="partial_text_general"/>
> >
> >   <fieldType name="partial_text_general" class="solr.TextField"
> > positionIncrementGap="100" multiValued="true">
> >         <analyzer type="index">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.EnglishPossessiveFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >                 <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="2"
> > maxGramSize="45" />
> >         </analyzer>
> >         <analyzer type="query">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >                 <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >                 <filter class="solr.EnglishPossessiveFilterFactory"/>
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >         </analyzer>
> >   </fieldType>
> >
> > I have a document with subject field: <str>cobrancas E-mail marketing
> > em dezembro, 2020 - referente ao uso de novembro</str>
> >
> > If I search for <str name="q">subject:"cobrancas e-mail"</str> then it
> > finds the document, but if I search for <str
> > name="q">subject:"cobrancas e-mail marketing"</str> I have no match.
> >
> > Why would this happen ?
> >
> > Thank you!
> >
> >
> >
>
>

Re: Solr limit in words search - take 2

Reply via email to