On 3/6/2018 10:16 AM, Moncif Aidi wrote:
I am using Solr to power faceting features for our application.
I know that SOLR can do free text search but what is the best practice for
faceting on common terms inside SOLR text fields?
Based on everything below, there might be a little bit of confusion
about exactly what faceting can offer you. It is an enormously powerful
feature, and generally has impressive performance. But there are
limitations, and sometimes performance is not what people expect.
As your other reply mentioned, configuring docValues on a field is
recommended for performance and other reasons with faceting. But when
you're dealing with a field set up for full-text search, that
recommendation generally has to be ignored, because you can't configure
docValues on a field using the TextField class.
For example, we have a large blob of text (a description of a property)
which contains useful text to facet on like 'city', 'formation', 'year',
'school', 'skill', ... dozens more like these.
When you have a "large blob of text" there are generally two choices for
the information in a facet.
One is the entirety of the blob, which usually means that every single
document has a unique value, and in that case, facets are pretty much
useless, and will have terrible performance. It's useless because all
of the entries in the facet are probably going to have "1" for the
count, because only one document has each value.
The other is the individual terms (usually words) in the text. This is
also generally useless for facets, and usually has terrible
performance. Knowing that there are 100 million documents that have
"the" in the field somewhere is not very useful.
One obvious solution is to pre-process the data, parse the text, and create
the facets for each of these key phrases with a boolean yes/no value.
I'd ideally like to automate this, so I imagine the SOLR free text search
engine might allow this? e.g. Can I use the free text search engine to
remove stop words and collect counts of common phrases which we can then
present to the user?
And now you've mentioned that what you want is *phrases*. How do you
suggest Solr obtain this information? There are no filters included
with Solr that can figure out that one section of a few words is NOT a
phrase that people will be interested in, but another IS.
To get document counts that include a phrase, you have to have something
that can extract phrases from the big blob of text and add them to
another field, usually of type "string" -- using class StrField. This
probably has to happen in your indexing pipeline, not in Solr.
Then when you facet on that field, Solr will count the documents for
each value and give you that information.