Re: Can't find document without searching for exact text containing colon

Shawn Heisey Sat, 01 Oct 2022 08:51:50 -0700

On 10/1/22 08:34, Christopher Schultz wrote:

I have a multi-valued field of type text_general and a specificdocument contains one field value with text "foo:bar". When searchingfor either "foo" or "bar", I do not get this document in search results.
However, when searching for "foo:bar" or "foo*" or "*bar" I do get thedocument, so it's definitely there and the field value is being searched.
Is a colon (:) not a word-breaking token?

How exactly are you analyzing this field at index time and at querytime? For indexed values, a colon is a word break for most tokenizersor filters that do word breaks.

I don't know if this applies to your specific question, but it is goodto know: At query time, a colon is a special character to the luceneand edismax query parsers, and possibly other parsers. The value"foo:bar" means "search the field named foo for the term bar" whichmeans that the colon will be gone before the query text even hits theanalysis chain. If you intend a colon to actually be in what you'researching for, either enclose the value in quotes or escape it with abackslash. But be aware that because the colon is often a word break atindex time, the colon may not be in the index.

One solution to problems with special characters is to use thewhitespace tokenizer, then use the word delimiter filter withpreserveOriginal, so the original term as well as the split terms are inthe index.

In solrj, the ClientUtils class contains a method named escapeQueryCharsthat will escape all the characters that are special to commonly usedquery parsers.

I have another field containing email address and if I search for e.g."gmail.com" (without quotes), I'll get everyone whose email addressesend with "gmail.com".
Hmm. I just checked, and if I search for "gmail" (without .com) Idon't fine them. Maybe without whitespace, those characters (:, .) donot cause a word-split?

If you are using the StandardTokenizer, one of the things it does ispreserve anything that looks like it might be an email address so theyare not ripped apart into many small terms. That sounds like it isprobably causing this problem. If you actually do want an email addressripped apart into tiny terms, you may want to craft a different analysischain. Getting just the right analysis for your use case is often oneof the most time-consuming parts of setting up a Solr index. Sometimesyou need to use copyField in your schema to have the same input analyzedin more than one way.


Thanks,
Shawn

Re: Can't find document without searching for exact text containing colon

Reply via email to