Hello Miguel,

That's likely due to catenateAll/catenateWords. McNeil is first split so
you can find it using 'mc neil', but not 'mcneil'. Using the
catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can become
'mcneil' again.

If you haven't already, use Solr's analysis GUI [1] for testing these
configurations. It shows step by step what becomes of the index- and
query-time analysis chains, and if they match up in the end.

Regards,
Markus

[1] http://localhost:8983/solr/#/COLLECTION/analysis

Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
<[email protected]>:

> Hi Markus,
>
> Thanks so much for your recommendations.  Matching the splitOnCaseChange
> attributes  index-time with the query-time, partially fixed our issue.
> Now, if I search for [email protected] and provide the exact same
> case as the email is stored I get a successful result!  However, if I
> search using [email protected] (all lower-case), it doesn't match.
> Essentially, only if I search using the exact same case as the email is
> stored do I get results.  Any additional ideas on how I can get the email
> search to fully work?  Thanks again for your help.
>
> -Miguel
>
>
>
> -----Original Message-----
> From: Miguel Joy
> Sent: Tuesday, September 27, 2022 6:43 AM
> To: [email protected]
> Subject: RE: Solr Search - Mixed Case Issue
>
> Hi Markus,
>
> Thanks for your prompt reply to my issue.  I will try your suggestions and
> report back.
>
> Thanks,
> -Miguel
>
> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Sent: Tuesday, September 27, 2022 6:36 AM
> To: [email protected]
> Subject: Re: Solr Search - Mixed Case Issue
>
> CAUTION: This email originated from outside the organization. Do not click
> links or open attachments unless you recognize the sender and expect that
> the content is safe.
>
> Hello Miguel,
>
> The problem lies with the different index-time and query-time
> WordDelimiterFilter configurations.
>
> > In addition, its strange that we get search results on some mixed case
> email addresses
>
> Yes, precisely!
>
> See the splitOnCaseChange attributes, that is where the problem is. In
> your case you should be able to copy the index-time configuration to the
> query-time and get rid of the problem without reindex. It 'should' solve
> the problem. If not, try to enable catenateAll, on both sides, but that
> requires reindex.
>
> Ideally you should probably also get rid of the StopFilterFactory, unless
> very well configured (which i do not suspect) it will cause additional
> weird problems. This does require reindexing.
>
> Regards,
> Markus
>
> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
> <[email protected]>:
>
> > Hi all,
> >
> > I'm new to Solr and recently inherited a Solr application (version
> > 5.4) from a previous developer with very little documentation.  At any
> > rate, my problem is this:
> >
> > I have some email addresses that are stored as mixed case.
> >
> > [email protected]<mailto:[email protected]> = Success [querying for
> > this email address and passing in the full email address in any case
> > [upper or lower] returns the correct result]
> >
> > [email protected]<mailto:[email protected]> = Fail [querying
> > for this email address and passing in the full email address in any
> > case [upper or lower] returns zero results]
> >
> > And here's the fieldType definition that's used for email addresses:
> >
> > <fieldType name="text_phonetic" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >       <analyzer type="index">
> >         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> > splitOnNumerics="0"/>
> >                 <filter class="solr.PhoneticFilterFactory"
> > encoder="Caverphone" inject="true"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >         <filter class="solr.PorterStemFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >         <filter class="solr.StopFilterFactory"
> >                 ignoreCase="true"
> >                 words="stopwords.txt"
> >                 />
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > splitOnNumerics="0"/>
> >                                 <filter
> class="solr.PhoneticFilterFactory"
> > encoder="Caverphone" inject="true"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> >                 <filter class="solr.PorterStemFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
> > I've spent a couple days researching this issue, and my best guess at
> > a fix would be to re-index this data using the LowerCaseFilterFatory
> > so that all email addresses are stored in lower case, but that would
> > be a significant change as I have over 10 million docs indexed.  In
> > addition, its strange that we get search results on some mixed case
> > email addresses, but not all, so I'm hoping that maybe all we need is
> > to tweak the query analyzer?  Thanks in advance for your help with
> > this question.  Please let me know if you need any additional details.
> >
> > -Miguel
> >
> >
> >
> > ________________________________
> >
> > Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> > sublicensees (including Ovation Travel Group and Egencia) use certain
> > trademarks and service marks of American Express Company or its
> > subsidiaries (American Express) in the 'American Express Global
> > Business Travel' and 'American Express Meetings & Events' brands and
> > in connection with its business for permitted uses only under a
> > limited licence from American Express (Licensed Marks). The Licensed
> > Marks are trademarks or service marks of, and the property of,
> > American Express. GBT UK is a subsidiary of Global Business Travel
> > Group, Inc. (NYSE: GBTG). American Express holds a minority interest
> > in GBTG, which operates as a separate company from American Express.
> >
> > ________________________________
> >
> > This email message and all attachments transmitted with it are solely
> > for the use of the intended recipient(s) and may contain confidential
> > and/or privileged information. If the reader of this message is not
> > the intended recipient, you are hereby notified that any
> > dissemination, distribution, copying and/or other use of this message
> > or its attachments is strictly prohibited. If you have received this
> > message in error, please notify the sender and delete it immediately.
> > Unintended transmission shall not constitute a waiver of the
> attorney-client or any other privilege.
> >
> > ________________________________
> > Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
> > sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
> > utilise certaines marques commerciales et marques de services
> > d'American Express Company ou de ses filiales (American Express) dans
> > les marques < American Express Global Business Travel > et < American
> > Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des
> > fins autoris?es uniquement, sous une licence limit?e accord?e par
> American Express (marques sous licence).
> > Les marques sous licence sont des marques commerciales ou des marques
> > de services d'American Express, dont elles sont la propri?t?. GBT UK
> > est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
> > American Express d?tient une participation minoritaire dans GBTG, qui
> > op?re en tant que soci?t? distincte d'American Express.
> >
> > ________________________________
> >
> > Ce message ?lectronique et toutes les pi?ces jointes transmises avec
> > celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
> > vis?s et peuvent contenir des informations confidentielles et/ou
> > privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
> pr?vu, vous ?tes inform?
> > par la pr?sente que toute diffusion, distribution, copie et/ou autre
> > utilisation de ce message ou de ses pi?ces jointes est strictement
> > interdite. Si vous avez re?u ce message par erreur, veuillez en
> > informer l'exp?diteur et le supprimer imm?diatement. Une transmission
> > involontaire ne constitue pas une renonciation au secret professionnel
> > ou ? toute autre pr?rogative.
> >
> > ________________________________
> >
>
>
> ________________________________
>
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
> sublicensees (including Ovation Travel Group and Egencia) use certain
> trademarks and service marks of American Express Company or its
> subsidiaries (American Express) in the 'American Express Global Business
> Travel' and 'American Express Meetings & Events' brands and in connection
> with its business for permitted uses only under a limited licence from
> American Express (Licensed Marks). The Licensed Marks are trademarks or
> service marks of, and the property of, American Express. GBT UK is a
> subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American
> Express holds a minority interest in GBTG, which operates as a separate
> company from American Express.
>
> ________________________________
>
> This email message and all attachments transmitted with it are solely for
> the use of the intended recipient(s) and may contain confidential and/or
> privileged information. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution,
> copying and/or other use of this message or its attachments is strictly
> prohibited. If you have received this message in error, please notify the
> sender and delete it immediately. Unintended transmission shall not
> constitute a waiver of the attorney-client or any other privilege.
>
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
> sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise
> certaines marques commerciales et marques de services d’American Express
> Company ou de ses filiales (American Express) dans les marques « American
> Express Global Business Travel » et « American Express Meetings & Events »
> ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous
> une licence limitée accordée par American Express (marques sous licence).
> Les marques sous licence sont des marques commerciales ou des marques de
> services d’American Express, dont elles sont la propriété. GBT UK est une
> filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American
> Express détient une participation minoritaire dans GBTG, qui opère en tant
> que société distincte d’American Express.
>
> ________________________________
>
> Ce message électronique et toutes les pièces jointes transmises avec
> celui-ci sont uniquement destinés à l’usage du ou des destinataires visés
> et peuvent contenir des informations confidentielles et/ou privilégiées. Si
> le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé
> par la présente que toute diffusion, distribution, copie et/ou autre
> utilisation de ce message ou de ses pièces jointes est strictement
> interdite. Si vous avez reçu ce message par erreur, veuillez en informer
> l’expéditeur et le supprimer immédiatement. Une transmission involontaire
> ne constitue pas une renonciation au secret professionnel ou à toute autre
> prérogative.
>
> ________________________________
>

Reply via email to