Prasanna,
Wouldn't it be better to use built-in token filters at both index and
query that will convert 'it!' to just 'it'? I believe the
WorkDelimeterFilterFactory will do that for you.
Christian
On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan <pranganat...@netflix.com
> wrote:
On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <shalinman...@gmail.com>
wrote:
Alternatively, is there a filter available which takes in a
pattern and
produces additional forms of the token depending on the pattern?
The use
case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file
entries
match a specific pattern and having such a filter would make it
easier I
believe. Pl. do correct me in case I am missing some unwanted side-
effect
with this approach.
I do not understand this. TokenFilters are used for things like
stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter
inserts
additional tokens (synonyms) from a file for each token.
What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?
I ll try to explain with an example. Given the term 'it!' in the
title, it
should match both 'it' and 'it!' in the query as an exact match.
Currently,
this is done by using a synonym entry (and index time
SynonymFilter) as
follows:
it! => it, it!
Now, the above holds true for all cases where you have a title token
of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.
I am hoping to do the same by using a index time filter that takes
in a
pattern like the PatternReplace filter and adds the newly created
token
instead of replacing the original one. Does this make sense? Am I
missing
something that would break this approach?
Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.
What is the overhead incurred in having an additional filter applied
during
indexing? It is strictly CPU only?
Thanks a lot for your valuable input.
Regards,
Prasanna.