Prasanna,

Wouldn't it be better to use built-in token filters at both index and query that will convert 'it!' to just 'it'? I believe the WorkDelimeterFilterFactory will do that for you.

Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan <pranganat...@netflix.com > wrote:




On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <shalinman...@gmail.com> wrote:

Alternatively, is there a filter available which takes in a pattern and produces additional forms of the token depending on the pattern? The use
case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file entries match a specific pattern and having such a filter would make it easier I believe. Pl. do correct me in case I am missing some unwanted side- effect
with this approach.


I do not understand this. TokenFilters are used for things like stemming, replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?

I ll try to explain with an example. Given the term 'it!' in the title, it should match both 'it' and 'it!' in the query as an exact match. Currently, this is done by using a synonym entry (and index time SynonymFilter) as
follows:

it! => it, it!

Now, the above holds true for all cases where you have a title token of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

I am hoping to do the same by using a index time filter that takes in a pattern like the PatternReplace filter and adds the newly created token instead of replacing the original one. Does this make sense? Am I missing
something that would break this approach?


Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.

What is the overhead incurred in having an additional filter applied during
indexing? It is strictly CPU only?

Thanks a lot for your valuable input.

Regards,

Prasanna.

Reply via email to