Shawn,
On 10/1/22 11:50, Shawn Heisey wrote:
On 10/1/22 08:34, Christopher Schultz wrote:
I have a multi-valued field of type text_general and a specific
document contains one field value with text "foo:bar". When searching
for either "foo" or "bar", I do not get this document in search results.
However, when searching for "foo:bar" or "foo*" or "*bar" I do get the
document, so it's definitely there and the field value is being searched.
Is a colon (:) not a word-breaking token?
How exactly are you analyzing this field at index time and at query
time? For indexed values, a colon is a word break for most tokenizers
or filters that do word breaks.
I'm using SolrJ client and just specifying a document field like this:
document.addField("fieldName", "foo:bar");
The field "fieldName" doesn't have anything out of the ordinary
specified for it. That is, I haven't customized *anything* about the
indexing of this (or any other) field in my Solr core.
"fieldName" is copied to "all", and "all" is the default search field.
I don't know if this applies to your specific question, but it is good
to know: At query time, a colon is a special character to the lucene
and edismax query parsers, and possibly other parsers. The value
"foo:bar" means "search the field named foo for the term bar" which
means that the colon will be gone before the query text even hits the
analysis chain. If you intend a colon to actually be in what you're
searching for, either enclose the value in quotes or escape it with a
backslash. But be aware that because the colon is often a word break at
index time, the colon may not be in the index.
I can find the document by searching for "foo:bar" (with quotes, because
as you say 'foo' isn't a field name). But I cannot find the document by
searching for any of the following:
foo
bar
all:foo
all:bar
fieldName:foo
fieldName:bar
These queries will return the document I'm looking for:
*bar
all:*bar
fieldName:*bar
One solution to problems with special characters is to use the
whitespace tokenizer, then use the word delimiter filter with
preserveOriginal, so the original term as well as the split terms are in
the index.
In solrj, the ClientUtils class contains a method named escapeQueryChars
that will escape all the characters that are special to commonly used
query parsers.
The thing is, I think no user will likely ever search for "foo:bar".
They will usually just want to search for "bar" but being able to search
specifically for "foo:bar" (or whatever I need to do to arrange for
foo+bar to be ranked higher in the results than just-foo and just-bar).
It's starting to sound like simply changing my indexing from:
doc.addField("fieldName", "foo:bar");
to this:
doc.addField("fieldName", "foo bar");
... will accomplish my objective. Users who search for "foo bar"
(without quotes) will get documents with "foo" and "bar" together ranked
more-highly than others, so that's a win. Those terms mat match other
fields which have been copied into the "all" field as well, but that's
not a problem for me.
I have another field containing email address and if I search for e.g.
"gmail.com" (without quotes), I'll get everyone whose email addresses
end with "gmail.com".
Hmm. I just checked, and if I search for "gmail" (without .com) I
don't fine them. Maybe without whitespace, those characters (:, .) do
not cause a word-split?
If you are using the StandardTokenizer, one of the things it does is
preserve anything that looks like it might be an email address so they
are not ripped apart into many small terms. That sounds like it is
probably causing this problem. If you actually do want an email address
ripped apart into tiny terms, you may want to craft a different analysis
chain. Getting just the right analysis for your use case is often one
of the most time-consuming parts of setting up a Solr index. Sometimes
you need to use copyField in your schema to have the same input analyzed
in more than one way.
Is there a simple way for me to dump something to show you how the
fields are being analyzed? To short-circuit me saying things like "I
don't know what's really going on under there"? ;)
-chris