Re: Protwords in solr spellchecker
Hi Kamal, Not necessarily. You can have different filters applied at index time and query time. (note that the order in which filters are defined matters). You could just add the stop filter at query time. Have your own custom data type defined (similar to 'text_en' that will be in schem.xml) and perhaps use standard/whitespace tokenizer followed by stop filter at query time. Tip: Use analysis tool that is available in solr admin page to further understand the analysis chain of data types. HTH On Fri, Jul 10, 2015 at 1:03 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi David, This one is a good suggestion. But, if add these *adult* keywords in the stopwords.txt file, it will be requiring the re-indexing of these keywords related data. How can I see the change instantly. Is there any other great suggestion that you can suggest me. On Thu, Jul 9, 2015 at 12:09 PM, davidphilip cherian davidphilipcher...@gmail.com wrote: The best bet is to use solr.StopFilterFactory. Have all such words added to stopwords.txt and add this filter to your analyzer. Reference links https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter HTH On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there any feature by which I can refrain the following words to appear in spell suggestion. For example: Somebody searches for sexe, I does not want to show him sex as the spell suggestion via solr. How can I stop these type of keywords to be shown in suggestion. Any help is appreciated. Regards Kamal Kishore Solr Beginner
Re: Protwords in solr spellchecker
Hi David, This one is a good suggestion. But, if add these *adult* keywords in the stopwords.txt file, it will be requiring the re-indexing of these keywords related data. How can I see the change instantly. Is there any other great suggestion that you can suggest me. On Thu, Jul 9, 2015 at 12:09 PM, davidphilip cherian davidphilipcher...@gmail.com wrote: The best bet is to use solr.StopFilterFactory. Have all such words added to stopwords.txt and add this filter to your analyzer. Reference links https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter HTH On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there any feature by which I can refrain the following words to appear in spell suggestion. For example: Somebody searches for sexe, I does not want to show him sex as the spell suggestion via solr. How can I stop these type of keywords to be shown in suggestion. Any help is appreciated. Regards Kamal Kishore Solr Beginner
Re: Protwords in solr spellchecker
So let's try to analyse the situation from the spellchecking point of view . First of all we follow David suggestions and we add in the QueryTime analysis, the StopWordsFilter, with our configured bad words. *Starting scenario* - we have the protected words in our index, we still want them to be in there Let's explore the different kind of Spellcheckers available, where do they take the suggestions ? : *Index Based Spellchecker* The suggestions will come from an auxiliary index. *Direct Spellchecker* The suggestions will come from the current index. *File based spellchecker* It uses an external file to get the spelling suggestions from, so we can curate this file properly with only good words, and we are fine. But I guess you would like to use a blacklist, in this case we are going to have a white list. *Query Time* At query time *the query is analysed *and a token stream is provided. Then depending on the implementation we trigger a different lookup. In the case of the Direct Spellchecker, if I remember well : For each token a FST with all the supported inflections is generated and an intersection happen with the Index FST ( based on the field), and the suggestion is returned. Unfortunately a proper* query time analysis will not help .* When we analyse the query we have the misspelled word sexe that is not going to be recognised as the bad word. Then the inflections are calculated, the FST built and the intersection will actually produce the feared suggestion sex . This because the word is in the index. If we can't modify the index, the *Direct Spellcheck is not an option *if my understanding is correct. Let's see if the Index Based spellcheck can help … Unfortunately also in this case, the auxiliary index produced is based on the analysed form of the original field. If you really can not re-index content I would suggest you an implementation based on a concept similar to the AnalyzingSuggester in Solr. Open to clarify your further questions. 2015-07-10 9:31 GMT+01:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Kamal, Not necessarily. You can have different filters applied at index time and query time. (note that the order in which filters are defined matters). You could just add the stop filter at query time. Have your own custom data type defined (similar to 'text_en' that will be in schem.xml) and perhaps use standard/whitespace tokenizer followed by stop filter at query time. Tip: Use analysis tool that is available in solr admin page to further understand the analysis chain of data types. HTH On Fri, Jul 10, 2015 at 1:03 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi David, This one is a good suggestion. But, if add these *adult* keywords in the stopwords.txt file, it will be requiring the re-indexing of these keywords related data. How can I see the change instantly. Is there any other great suggestion that you can suggest me. On Thu, Jul 9, 2015 at 12:09 PM, davidphilip cherian davidphilipcher...@gmail.com wrote: The best bet is to use solr.StopFilterFactory. Have all such words added to stopwords.txt and add this filter to your analyzer. Reference links https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter HTH On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there any feature by which I can refrain the following words to appear in spell suggestion. For example: Somebody searches for sexe, I does not want to show him sex as the spell suggestion via solr. How can I stop these type of keywords to be shown in suggestion. Any help is appreciated. Regards Kamal Kishore Solr Beginner -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
RE: Protwords in solr spellchecker
Kamal, Given the constraint that you cannot re-index the data, your best bet might be to simply filter out the suggestions at the application level, or maybe even have a proxy do it. Possibly another option, you might be able to extend DirectSolrSpellchecker and override #getSuggestions(), calling super(), then post-filtering out your stop words from the response. You'll want to request a few more terms so you're more likely to get results even if a term or two get filtered out. You can specify your custom spell checker in solrconfig.xml. James Dyer Ingram Content Group -Original Message- From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] Sent: Friday, July 10, 2015 7:00 AM To: solr-user@lucene.apache.org Subject: Re: Protwords in solr spellchecker So let's try to analyse the situation from the spellchecking point of view . First of all we follow David suggestions and we add in the QueryTime analysis, the StopWordsFilter, with our configured bad words. *Starting scenario* - we have the protected words in our index, we still want them to be in there Let's explore the different kind of Spellcheckers available, where do they take the suggestions ? : *Index Based Spellchecker* The suggestions will come from an auxiliary index. *Direct Spellchecker* The suggestions will come from the current index. *File based spellchecker* It uses an external file to get the spelling suggestions from, so we can curate this file properly with only good words, and we are fine. But I guess you would like to use a blacklist, in this case we are going to have a white list. *Query Time* At query time *the query is analysed *and a token stream is provided. Then depending on the implementation we trigger a different lookup. In the case of the Direct Spellchecker, if I remember well : For each token a FST with all the supported inflections is generated and an intersection happen with the Index FST ( based on the field), and the suggestion is returned. Unfortunately a proper* query time analysis will not help .* When we analyse the query we have the misspelled word sexe that is not going to be recognised as the bad word. Then the inflections are calculated, the FST built and the intersection will actually produce the feared suggestion sex . This because the word is in the index. If we can't modify the index, the *Direct Spellcheck is not an option *if my understanding is correct. Let's see if the Index Based spellcheck can help … Unfortunately also in this case, the auxiliary index produced is based on the analysed form of the original field. If you really can not re-index content I would suggest you an implementation based on a concept similar to the AnalyzingSuggester in Solr. Open to clarify your further questions. 2015-07-10 9:31 GMT+01:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Kamal, Not necessarily. You can have different filters applied at index time and query time. (note that the order in which filters are defined matters). You could just add the stop filter at query time. Have your own custom data type defined (similar to 'text_en' that will be in schem.xml) and perhaps use standard/whitespace tokenizer followed by stop filter at query time. Tip: Use analysis tool that is available in solr admin page to further understand the analysis chain of data types. HTH On Fri, Jul 10, 2015 at 1:03 PM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi David, This one is a good suggestion. But, if add these *adult* keywords in the stopwords.txt file, it will be requiring the re-indexing of these keywords related data. How can I see the change instantly. Is there any other great suggestion that you can suggest me. On Thu, Jul 9, 2015 at 12:09 PM, davidphilip cherian davidphilipcher...@gmail.com wrote: The best bet is to use solr.StopFilterFactory. Have all such words added to stopwords.txt and add this filter to your analyzer. Reference links https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter HTH On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there any feature by which I can refrain the following words to appear in spell suggestion. For example: Somebody searches for sexe, I does not want to show him sex as the spell suggestion via solr. How can I stop these type of keywords to be shown in suggestion. Any help is appreciated. Regards Kamal Kishore Solr Beginner -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What
Re: Protwords in solr spellchecker
The best bet is to use solr.StopFilterFactory. Have all such words added to stopwords.txt and add this filter to your analyzer. Reference links https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter HTH On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal kkroyal@gmail.com wrote: Hi Team, I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there any feature by which I can refrain the following words to appear in spell suggestion. For example: Somebody searches for sexe, I does not want to show him sex as the spell suggestion via solr. How can I stop these type of keywords to be shown in suggestion. Any help is appreciated. Regards Kamal Kishore Solr Beginner