Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-12 Thread Chris Hostetter

:  There is a Solr.PatternTokenizerFactory class which likely fits the bill in
: this case. The related question I have is this - is it possible to have
: multiple Tokenizers in your analysis chain?

No .. Tokenizers consume CharReaders and produce a TokenStream ... what's 
needed here is a TokenFilter that comsumes a TOkenStream and produces a 
TokenStream





-Hoss



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-07 Thread Prasanna Ranganathan


On 10/6/09 3:32 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 :  I ll try to explain with an example. Given the term 'it!' in the title, it
 : should match both 'it' and 'it!' in the query as an exact match. Currently,
 : this is done by using a synonym entry  (and index time SynonymFilter) as
 : follows:
 : 
 :  it! = it, it!
 : 
 :  Now, the above holds true for all cases where you have a title token of the
 : form [aA-zZ]*!. Handling all of those cases requires adding synonyms
 : manually for each case which is not easy to manage and does not scale.
 : 
 :  I am hoping to do the same by using a index time filter that takes in a
 : pattern like the PatternReplace filter and adds the newly created token
 : instead of replacing the original one. Does this make sense? Am I missing
 : something that would break this approach?
 
 something like this would be fairly easy to implement in Lucene, but
 somewhat confusing to try and configure in Solr.  I was going to suggest
 that you use something like...
  filter class=solr.PatternReplaceFilterFactory
 pattern=(^.*)\!?$) replacement=$1 $2 replace=all /
 
 ..and then have a subsequent filter that splits the tokens on the
 whitespace (or any other special character you could use in the
 replacement) ... but aparently we don't have any built in filters that
 will just split tokens on a character/pattern for you.  that would also be
 fairly easy to write if someone wnats to submit a patch.

 There is a Solr.PatternTokenizerFactory class which likely fits the bill in
this case. The related question I have is this - is it possible to have
multiple Tokenizers in your analysis chain?

Prasanna.



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-06 Thread Chris Hostetter

:  I ll try to explain with an example. Given the term 'it!' in the title, it
: should match both 'it' and 'it!' in the query as an exact match. Currently,
: this is done by using a synonym entry  (and index time SynonymFilter) as
: follows:
: 
:  it! = it, it!
: 
:  Now, the above holds true for all cases where you have a title token of the
: form [aA-zZ]*!. Handling all of those cases requires adding synonyms
: manually for each case which is not easy to manage and does not scale.
: 
:  I am hoping to do the same by using a index time filter that takes in a
: pattern like the PatternReplace filter and adds the newly created token
: instead of replacing the original one. Does this make sense? Am I missing
: something that would break this approach?

something like this would be fairly easy to implement in Lucene, but 
somewhat confusing to try and configure in Solr.  I was going to suggest 
that you use something like...
 filter class=solr.PatternReplaceFilterFactory
pattern=(^.*)\!?$) replacement=$1 $2 replace=all /

..and then have a subsequent filter that splits the tokens on the 
whitespace (or any other special character you could use in the 
replacement) ... but aparently we don't have any built in filters that 
will just split tokens on a character/pattern for you.  that would also be 
fairly easy to write if someone wnats to submit a patch.


-Hoss



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Shalin Shekhar Mangar
On Fri, Oct 2, 2009 at 11:31 PM, Prasanna Ranganathan 
pranganat...@netflix.com wrote:


  Does the PatternReplaceFilter have an option where you can keep the
 original token in addition to the modified token? From what I looked at it
 does not seem to but I want to confirm the same.


No, it does not.


 Alternatively, is there a filter available which takes in a pattern and
 produces additional forms of the token depending on the pattern? The use
 case I am looking at here is using such a filter to automate synonym
 generation. In our application, quite a few of the synonym file entries
 match a specific pattern and having such a filter would make it easier I
 believe. Pl. do correct me in case I am missing some unwanted side-effect
 with this approach.


I do not understand this. TokenFilters are used for things like stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


 Continuing on that line, what is the performance hit in having additional
 index-time filters as opposed to using a synonym file with more entries?
 How
 does the overhead of using a bigger synonym file as opposed to additional
 filters compare?


Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Prasanna Ranganathan

 Can someone please give me some pointers to the questions in my earlier
email? And and every help is much appreciated.

Regards,

Prasanna.


On 10/2/09 11:01 AM, Prasanna Ranganathan pranganat...@netflix.com
wrote:

 
  Does the PatternReplaceFilter have an option where you can keep the original
 token in addition to the modified token? From what I looked at it does not
 seem to but I want to confirm the same.
 
 Alternatively, is there a filter available which takes in a pattern and
 produces additional forms of the token depending on the pattern? The use case
 I am looking at here is using such a filter to automate synonym generation. In
 our application, quite a few of the synonym file entries match a specific
 pattern and having such a filter would make it easier I believe. Pl. do
 correct me in case I am missing some unwanted side-effect with this approach.
 
 Continuing on that line, what is the performance hit in having additional
 index-time filters as opposed to using a synonym file with more entries? How
 does the overhead of using a bigger synonym file as opposed to additional
 filters compare?
 
 Thanks in advance for the help.
 
 Regards,
 
 Prasanna.



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Prasanna Ranganathan

I just saw the reply from Shalin after sending this email. Kindly excuse.


On 10/5/09 5:17 PM, Prasanna Ranganathan pranganat...@netflix.com wrote:

 
  Can someone please give me some pointers to the questions in my earlier
 email? And and every help is much appreciated.
 
 Regards,
 
 Prasanna.
 
 
 On 10/2/09 11:01 AM, Prasanna Ranganathan pranganat...@netflix.com wrote:
 
 
  Does the PatternReplaceFilter have an option where you can keep the original
 token in addition to the modified token? From what I looked at it does not
 seem to but I want to confirm the same.
 
 Alternatively, is there a filter available which takes in a pattern and
 produces additional forms of the token depending on the pattern? The use case
 I am looking at here is using such a filter to automate synonym generation.
 In our application, quite a few of the synonym file entries match a specific
 pattern and having such a filter would make it easier I believe. Pl. do
 correct me in case I am missing some unwanted side-effect with this approach.
 
 Continuing on that line, what is the performance hit in having additional
 index-time filters as opposed to using a synonym file with more entries? How
 does the overhead of using a bigger synonym file as opposed to additional
 filters compare?
 
 Thanks in advance for the help.
 
 Regards,
 
 Prasanna.



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Prasanna Ranganathan



On 10/5/09 2:46 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 Alternatively, is there a filter available which takes in a pattern and
 produces additional forms of the token depending on the pattern? The use
 case I am looking at here is using such a filter to automate synonym
 generation. In our application, quite a few of the synonym file entries
 match a specific pattern and having such a filter would make it easier I
 believe. Pl. do correct me in case I am missing some unwanted side-effect
 with this approach.
 
 
 I do not understand this. TokenFilters are used for things like stemming,
 replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
 additional tokens (synonyms) from a file for each token.
 
 What exactly are you trying to do with synonyms? I guess you could do
 stemming etc with synonyms but why do you want to do that?
 
 I ll try to explain with an example. Given the term 'it!' in the title, it
should match both 'it' and 'it!' in the query as an exact match. Currently,
this is done by using a synonym entry  (and index time SynonymFilter) as
follows:

 it! = it, it!

 Now, the above holds true for all cases where you have a title token of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

 I am hoping to do the same by using a index time filter that takes in a
pattern like the PatternReplace filter and adds the newly created token
instead of replacing the original one. Does this make sense? Am I missing
something that would break this approach?

 
 Note that a change in synonym file needs a re-index of the affected
 documents. Also, the synonym map is kept in memory.

 What is the overhead incurred in having an additional filter applied during
indexing? It is strictly CPU only?

 Thanks a lot for your valuable input.

Regards,

Prasanna.



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Christian Zambrano

Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.


Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan pranganat...@netflix.com 
 wrote:






On 10/5/09 2:46 AM, Shalin Shekhar Mangar shalinman...@gmail.com  
wrote:


Alternatively, is there a filter available which takes in a  
pattern and
produces additional forms of the token depending on the pattern?  
The use

case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file  
entries
match a specific pattern and having such a filter would make it  
easier I
believe. Pl. do correct me in case I am missing some unwanted side- 
effect

with this approach.


I do not understand this. TokenFilters are used for things like  
stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter  
inserts

additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


I ll try to explain with an example. Given the term 'it!' in the  
title, it
should match both 'it' and 'it!' in the query as an exact match.  
Currently,
this is done by using a synonym entry  (and index time  
SynonymFilter) as

follows:

it! = it, it!

Now, the above holds true for all cases where you have a title token  
of the

form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

I am hoping to do the same by using a index time filter that takes  
in a
pattern like the PatternReplace filter and adds the newly created  
token
instead of replacing the original one. Does this make sense? Am I  
missing

something that would break this approach?



Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.


What is the overhead incurred in having an additional filter applied  
during

indexing? It is strictly CPU only?

Thanks a lot for your valuable input.

Regards,

Prasanna.



Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-05 Thread Prasanna Ranganathan

On 10/5/09 8:59 PM, Christian Zambrano czamb...@gmail.com wrote:

 
 Wouldn't it be better to use built-in token filters at both index and
 query that will convert 'it!' to just 'it'? I believe the
 WorkDelimeterFilterFactory will do that for you.
 

 We do have a field that uses WordDelimiterFilter but it also uses a Stemmer
and Stopword filter. That field is used for a stemmed match with a nominal
boost. However, the field I am talking about is for an exact match (only
lowercase and synonym filter) with a higher boost than the field with the
WordDelimiterFilter.

Prasanna.