There are a number of ways this can be accomplished, including as a
preprocessor or a custom update processor, but you may be able to get by
with a tokenized field without term vectors combined with a "keep words"
filter and an index-time synonym filter that uses "replace mode".
So, in addition to storing the text in a normal text field, do a copyField
to a separate text field which has omitTermFreqAndPositions=true since this
field only needs to indicate the presence of a keyword and not its position
or frequency. It would have a custome field type which starts its index
analyzer with a "keep words" token filter (solr.KeepWordFilterFactory) with
a word list file which contains all words used in your synonyms. This
eliminates all words that do not match one of your synonym words.
Then add a synonym filter that operates in replace mode - expand=true and
ignoreCase=true, with entries such as:
feline,cat,lion,tiger
See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
This would index "The cat sat on the tiger's mat" as simply "feline"
-- Jack Krupansky
-----Original Message-----
From: ben ausden
Sent: Saturday, June 23, 2012 1:21 PM
To: solr-user@lucene.apache.org
Subject: Store matching synonyms only
Hi,
Is it possible to store only the matching synonyms found in a piece of
text?
A use case might be: automatically "tag" documents at index time based on
synonyms.txt, and then retrieve the stored tags at query time.
For example, given the text field:
"The cat sat on the mat"
and a synonyms.txt file containing:
feline,cat,lion,tiger
the resulting tag for this document would be "feline". Multiple synonym
matches would result in multiple tags.
Is this possible with Solr by default, or is the classification/tagging
best done outside Solr before I store the document?
Thanks.