Russian stemmer
Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks.
Re: Russian stemmer
All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
another look, your problem is ковров itself... its mapped to ковр a workaround might be to use the protected words functionality to keep ковров and any other problematic people/geo names as-is. separately, in trunk there is an alternative russian stemmer (RussianLightStemFilterFactory), which might give you less problems on average, but I noticed it has this same problem with the example you gave. On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote: All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and Query time, because the search works ok, except names and geo locations. I've noticed that searching by Коврова also shows documents that contain Коврову, Коврове Search by Ковров, 7 results: http://www.sova-center.ru/search/?q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2 Search by Коврова, 26 results: http://www.sova-center.ru/search/?lg=1q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2%D0%B0 Adding such words in stopwords.txt will be a tedious task, as there are 7 millions russian names :) Kind Regards, Oleg Burlaca On Tue, Jul 27, 2010 at 11:35 AM, Robert Muir rcm...@gmail.com wrote: another look, your problem is ковров itself... its mapped to ковр a workaround might be to use the protected words functionality to keep ковров and any other problematic people/geo names as-is. separately, in trunk there is an alternative russian stemmer (RussianLightStemFilterFactory), which might give you less problems on average, but I noticed it has this same problem with the example you gave. On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote: All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
Re: Russian stemmer
Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
Re: Russian stemmer
2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
right, but your problem is this is the current output: Ковров - Ковр Коврову - Ковров Ковровом - Ковров Коврове - Ковров so, if Ковров was simply left alone, all your forms would match... 2010/7/27 Oleg Burlaca o...@burlaca.com Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
I have studied some Russian. I kind of got the picture from the texts that all the exceptions had already been 'found', and were listed in the book. I do know that languages are living, changing organisms, but Russian has got to be more regular than English I would think, even WITH all six cases and 3 genders. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Tue, 7/27/10, Robert Muir rcm...@gmail.com wrote: From: Robert Muir rcm...@gmail.com Subject: Re: Russian stemmer To: solr-user@lucene.apache.org Date: Tuesday, July 27, 2010, 7:12 AM right, but your problem is this is the current output: Ковров - Ковр Коврову - Ковров Ковровом - Ковров Коврове - Ковров so, if Ковров was simply left alone, all your forms would match... 2010/7/27 Oleg Burlaca o...@burlaca.com Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Problem with Russian stemmer in Solr 1.2
Hi Daniel How to implement custom Russian factory with various Tokenizers and Filters? Can you provide some code examples? Regards, Andrew Daniel Alheiros wrote: Hi Andrew Yes, I saw that. As I'm not knowledgeable in Russian I had to infer it was adequate. But as you have much more to add to it, it could be interesting if you could contribute that. The problem is Russian analyzer and it's filters are all final class, don't allowing an elegant extension. But you can create an analyzer that reuse what is interesting for you (in this case, the stemmer) and customize the other filters. I would propose you to do that creating the Solr factories so you can point to your files containing your stopwords. Any chance you could contribute with this stopwords list? One of my reasons to not use directly the RussianAnalyzer was that I need to use an WhitespaceTokenizer removing HTML code... So I created my factories. Regards, Daniel -- View this message in context: http://www.nabble.com/Problem-with-Russian-stemmer-in-Solr-1.2-tf4049948.html#a11646823 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with Russian stemmer in Solr 1.2
Hi Andrew. This is an example for one FilterFactory: public class RussianStemFilterFactory extends BaseTokenFilterFactory { private String charset;/** * @see org.apache.solr.analysis.BaseTokenFilterFactory#init(java.util.Map) */ @Overridepublic void init(MapString, String arg0){ super.init(arg0);String charsetName = args.get(charsetName);this.charset = charsetName;}} /** * @see org.apache.solr.analysis.TokenFilterFactory#create(org.apache.lucene.analysi s.TokenStream) */public TokenStream create(TokenStream tokenStream) {return new RussianStemFilter(tokenStream, charset.getChars());} } When you run the args.get(String) you are going to get a property defined in your schema.xml like this: filter class=myCompany.RussianStemFilterFactory charsetName=UnicodeRussian/ For a tokenizer that prepares for your filters: public class HTMLStripRussianLetterTokenizerFactory extends BaseTokenizerFactory {private char[]charset; /** * @see org.apache.solr.analysis.BaseTokenizerFactory#init(java.util.Map) */ @Overridepublic void init(MapString, String arg0){ super.init(arg0); String charsetName = args.get(charsetName); this.charset = charsetName.getChars();} /** * @see org.apache.solr.analysis.TokenizerFactory#create(Reader) */public TokenStream create(Reader reader){return new RussianLetterTokenizer(new HTMLStripReader(reader), this.charset);} } tokenizer class=myCompany.HTMLStripRussianLetterTokenizerFactory charsetName=UnicodeRussian/ I hope it helps. Regards, Daniel On 17/7/07 11:34, Andrew Stromnov [EMAIL PROTECTED] wrote: Hi Daniel How to implement custom Russian factory with various Tokenizers and Filters? Can you provide some code examples? Regards, Andrew Daniel Alheiros wrote: Hi Andrew Yes, I saw that. As I'm not knowledgeable in Russian I had to infer it was adequate. But as you have much more to add to it, it could be interesting if you could contribute that. The problem is Russian analyzer and it's filters are all final class, don't allowing an elegant extension. But you can create an analyzer that reuse what is interesting for you (in this case, the stemmer) and customize the other filters. I would propose you to do that creating the Solr factories so you can point to your files containing your stopwords. Any chance you could contribute with this stopwords list? One of my reasons to not use directly the RussianAnalyzer was that I need to use an WhitespaceTokenizer removing HTML code... So I created my factories. Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Problem with Russian stemmer in Solr 1.2
Hi Andrew Yes, I saw that. As I'm not knowledgeable in Russian I had to infer it was adequate. But as you have much more to add to it, it could be interesting if you could contribute that. The problem is Russian analyzer and it's filters are all final class, don't allowing an elegant extension. But you can create an analyzer that reuse what is interesting for you (in this case, the stemmer) and customize the other filters. I would propose you to do that creating the Solr factories so you can point to your files containing your stopwords. Any chance you could contribute with this stopwords list? One of my reasons to not use directly the RussianAnalyzer was that I need to use an WhitespaceTokenizer removing HTML code... So I created my factories. Regards, Daniel On 9/7/07 19:36, Andrew Stromnov [EMAIL PROTECTED] wrote: Hi, Daniel Stemmer in RussianAnalyser works as expected. But this analyser doesn't allow any Solr customization. All stopwords are hardcoded, no support for custom tokenizer, no synonym support. RussianAnalyser is similar to this scheme: standard tokenizer standard filter factory word delimeter filter factory lowercase filter factory stop filter factory (with hardcoded stopwords) russian stem filter Regards, Andrew Daniel Alheiros wrote: Hi Andrew In fact I did it creating all the Factories for Solr, but I think you can use it directly, changing your index like this: fieldtype name=cpstext_russian class=solr.TextField positionIncrementGap=100 analyzer type=index class=²org.apache.lucene.analysis.ru.RussianAnalyzer² /analyzer analyzer type=query class=²org.apache.lucene.analysis.ru.RussianAnalyzer² /analyzer /fieldtype I¹ve not tested that, but I saw something like this. Please tell me if it works as expected and if it solves your problem (I¹m indexing Russian content and as you seem to be knowledgeable of Russian language your comments are very useful). Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Problem with Russian stemmer in Solr 1.2
Hi Daniel, Yes, I want to try RussianAnalyzer. How to enable it in Solr config? Thank you. Daniel Alheiros wrote: Hi Andrew. I'm using the RussianAnalyzer (part of the Lucene analyzers) and it reduces списки to списк. Do you want to try this other Analyzer? Regards, Daniel On 9/7/07 16:06, Andrew Stromnov [EMAIL PROTECTED] wrote: списки arrondissement turvallisuuden http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. -- View this message in context: http://www.nabble.com/Problem-with-Russian-stemmer-in-Solr-1.2-tf4049948.html#a11505646 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with Russian stemmer in Solr 1.2
Hi Andrew In fact I did it creating all the Factories for Solr, but I think you can use it directly, changing your index like this: fieldtype name=cpstext_russian class=solr.TextField positionIncrementGap=100 analyzer type=index class=”org.apache.lucene.analysis.ru.RussianAnalyzer” /analyzer analyzer type=query class=”org.apache.lucene.analysis.ru.RussianAnalyzer” /analyzer /fieldtype I’ve not tested that, but I saw something like this. Please tell me if it works as expected and if it solves your problem (I’m indexing Russian content and as you seem to be knowledgeable of Russian language your comments are very useful). Regards, Daniel On 9/7/07 18:00, Andrew Stromnov [EMAIL PROTECTED] wrote: Hi Daniel, Yes, I want to try RussianAnalyzer. How to enable it in Solr config? Thank you. Daniel Alheiros wrote: Hi Andrew. I'm using the RussianAnalyzer (part of the Lucene analyzers) and it reduces списки to списк. Do you want to try this other Analyzer? Regards, Daniel On 9/7/07 16:06, Andrew Stromnov [EMAIL PROTECTED] wrote: списки arrondissement turvallisuuden http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.