On 10/1/22 08:34, Christopher Schultz wrote:
I have a multi-valued field of type text_general and a specific document contains one field value with text "foo:bar". When searching for either "foo" or "bar", I do not get this document in search results.

However, when searching for "foo:bar" or "foo*" or "*bar" I do get the document, so it's definitely there and the field value is being searched.

Is a colon (:) not a word-breaking token?

How exactly are you analyzing this field at index time and at query time?  For indexed values, a colon is a word break for most tokenizers or filters that do word breaks.

I don't know if this applies to your specific question, but it is good to know:  At query time, a colon is a special character to the lucene and edismax query parsers, and possibly other parsers. The value "foo:bar" means "search the field named foo for the term bar" which means that the colon will be gone before the query text even hits the analysis chain.  If you intend a colon to actually be in what you're searching for, either enclose the value in quotes or escape it with a backslash.  But be aware that because the colon is often a word break at index time, the colon may not be in the index.

One solution to problems with special characters is to use the whitespace tokenizer, then use the word delimiter filter with preserveOriginal, so the original term as well as the split terms are in the index.

In solrj, the ClientUtils class contains a method named escapeQueryChars that will escape all the characters that are special to commonly used query parsers.

I have another field containing email address and if I search for e.g. "gmail.com" (without quotes), I'll get everyone whose email addresses end with "gmail.com".

Hmm. I just checked, and if I search for "gmail" (without .com) I don't fine them. Maybe without whitespace, those characters (:, .) do not cause a word-split?

If you are using the StandardTokenizer, one of the things it does is preserve anything that looks like it might be an email address so they are not ripped apart into many small terms.  That sounds like it is probably causing this problem.  If you actually do want an email address ripped apart into tiny terms, you may want to craft a different analysis chain.  Getting just the right analysis for your use case is often one of the most time-consuming parts of setting up a Solr index.  Sometimes you need to use copyField in your schema to have the same input analyzed in more than one way.

Thanks,
Shawn

Reply via email to