Question regarding snapinstaller
It looks like the snapinstaller script does an atomic remove and replace of the entire solr_home/data_dir/index folder with the contents of the new snapshot before issuing a commit command. I am trying to understand the implication of the same. What happens to queries that come during the time interval between the instant the existing directory is removed and the commit command gets finalized? Does a currently running instance of Solr not need the files in the index folder to serve the query results? Are all the contents of the index folder loaded into memory? Thanks in advance for any help. Regards, Prasanna.
Re: Question about PatternReplace filter and automatic Synonym generation
On 10/6/09 3:32 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I ll try to explain with an example. Given the term 'it!' in the title, it : should match both 'it' and 'it!' in the query as an exact match. Currently, : this is done by using a synonym entry (and index time SynonymFilter) as : follows: : : it! = it, it! : : Now, the above holds true for all cases where you have a title token of the : form [aA-zZ]*!. Handling all of those cases requires adding synonyms : manually for each case which is not easy to manage and does not scale. : : I am hoping to do the same by using a index time filter that takes in a : pattern like the PatternReplace filter and adds the newly created token : instead of replacing the original one. Does this make sense? Am I missing : something that would break this approach? something like this would be fairly easy to implement in Lucene, but somewhat confusing to try and configure in Solr. I was going to suggest that you use something like... filter class=solr.PatternReplaceFilterFactory pattern=(^.*)\!?$) replacement=$1 $2 replace=all / ..and then have a subsequent filter that splits the tokens on the whitespace (or any other special character you could use in the replacement) ... but aparently we don't have any built in filters that will just split tokens on a character/pattern for you. that would also be fairly easy to write if someone wnats to submit a patch. There is a Solr.PatternTokenizerFactory class which likely fits the bill in this case. The related question I have is this - is it possible to have multiple Tokenizers in your analysis chain? Prasanna.
Re: Question about PatternReplace filter and automatic Synonym generation
Can someone please give me some pointers to the questions in my earlier email? And and every help is much appreciated. Regards, Prasanna. On 10/2/09 11:01 AM, Prasanna Ranganathan pranganat...@netflix.com wrote: Does the PatternReplaceFilter have an option where you can keep the original token in addition to the modified token? From what I looked at it does not seem to but I want to confirm the same. Alternatively, is there a filter available which takes in a pattern and produces additional forms of the token depending on the pattern? The use case I am looking at here is using such a filter to automate synonym generation. In our application, quite a few of the synonym file entries match a specific pattern and having such a filter would make it easier I believe. Pl. do correct me in case I am missing some unwanted side-effect with this approach. Continuing on that line, what is the performance hit in having additional index-time filters as opposed to using a synonym file with more entries? How does the overhead of using a bigger synonym file as opposed to additional filters compare? Thanks in advance for the help. Regards, Prasanna.
Re: Question about PatternReplace filter and automatic Synonym generation
I just saw the reply from Shalin after sending this email. Kindly excuse. On 10/5/09 5:17 PM, Prasanna Ranganathan pranganat...@netflix.com wrote: Can someone please give me some pointers to the questions in my earlier email? And and every help is much appreciated. Regards, Prasanna. On 10/2/09 11:01 AM, Prasanna Ranganathan pranganat...@netflix.com wrote: Does the PatternReplaceFilter have an option where you can keep the original token in addition to the modified token? From what I looked at it does not seem to but I want to confirm the same. Alternatively, is there a filter available which takes in a pattern and produces additional forms of the token depending on the pattern? The use case I am looking at here is using such a filter to automate synonym generation. In our application, quite a few of the synonym file entries match a specific pattern and having such a filter would make it easier I believe. Pl. do correct me in case I am missing some unwanted side-effect with this approach. Continuing on that line, what is the performance hit in having additional index-time filters as opposed to using a synonym file with more entries? How does the overhead of using a bigger synonym file as opposed to additional filters compare? Thanks in advance for the help. Regards, Prasanna.
Re: Question about PatternReplace filter and automatic Synonym generation
On 10/5/09 2:46 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Alternatively, is there a filter available which takes in a pattern and produces additional forms of the token depending on the pattern? The use case I am looking at here is using such a filter to automate synonym generation. In our application, quite a few of the synonym file entries match a specific pattern and having such a filter would make it easier I believe. Pl. do correct me in case I am missing some unwanted side-effect with this approach. I do not understand this. TokenFilters are used for things like stemming, replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts additional tokens (synonyms) from a file for each token. What exactly are you trying to do with synonyms? I guess you could do stemming etc with synonyms but why do you want to do that? I ll try to explain with an example. Given the term 'it!' in the title, it should match both 'it' and 'it!' in the query as an exact match. Currently, this is done by using a synonym entry (and index time SynonymFilter) as follows: it! = it, it! Now, the above holds true for all cases where you have a title token of the form [aA-zZ]*!. Handling all of those cases requires adding synonyms manually for each case which is not easy to manage and does not scale. I am hoping to do the same by using a index time filter that takes in a pattern like the PatternReplace filter and adds the newly created token instead of replacing the original one. Does this make sense? Am I missing something that would break this approach? Note that a change in synonym file needs a re-index of the affected documents. Also, the synonym map is kept in memory. What is the overhead incurred in having an additional filter applied during indexing? It is strictly CPU only? Thanks a lot for your valuable input. Regards, Prasanna.
Re: Question about PatternReplace filter and automatic Synonym generation
On 10/5/09 8:59 PM, Christian Zambrano czamb...@gmail.com wrote: Wouldn't it be better to use built-in token filters at both index and query that will convert 'it!' to just 'it'? I believe the WorkDelimeterFilterFactory will do that for you. We do have a field that uses WordDelimiterFilter but it also uses a Stemmer and Stopword filter. That field is used for a stemmed match with a nominal boost. However, the field I am talking about is for an exact match (only lowercase and synonym filter) with a higher boost than the field with the WordDelimiterFilter. Prasanna.
Question about PatternReplace filter and automatic Synonym generation
Does the PatternReplaceFilter have an option where you can keep the original token in addition to the modified token? From what I looked at it does not seem to but I want to confirm the same. Alternatively, is there a filter available which takes in a pattern and produces additional forms of the token depending on the pattern? The use case I am looking at here is using such a filter to automate synonym generation. In our application, quite a few of the synonym file entries match a specific pattern and having such a filter would make it easier I believe. Pl. do correct me in case I am missing some unwanted side-effect with this approach. Continuing on that line, what is the performance hit in having additional index-time filters as opposed to using a synonym file with more entries? How does the overhead of using a bigger synonym file as opposed to additional filters compare? Thanks in advance for the help. Regards, Prasanna.
Effect of SynonymFilter on Solr document fields
Hi, I am a newbie to Solr and request you all to kindly excuse any rookie mistakes. I have the following questions: We use the Synonym Filter on one of the indexed fields. It is specified in the schema.xml as one of the filters (for the analyzer type index) for that field. I believe that this means any tokens which match an entry in the provided synonym file will have all the forms indexed provided expanded=true. I am able to verify that by using the Solr admin analysis tool. However when I use Luke to examine a document in the index which would have synonyms for that particular field, I see only the original value and do not see the additional forms that should be added due to the synonym match for the field in question. I am not sure if I am missing something here. How do I verify the same? Another related question The field in question here is not specified as multivalued. However, as I understand it a synonym match will mean multiple values for that field. I was not able to find any documentation that explains this in detail and would like to know how this particular case impacts the indexing of that field, scoring, etc. How does the behavior of a field having multiple values due to SynonymFilter compare and contrast with the multivalued=true|false flag. What would a synonym match expansion for a field with multivalued=false mean? Prasanna.