RE: I18N with SOLR?
You can have only one default search field But you can use the dismax request handler to search across several fields http://wiki.apache.org/solr/DisMaxRequestHandler Then you can use query field boosting to make one field more significant : Exact_text^3 text_fr^2 text_en^2 stemmed_text^1.5 -Message d'origine- De : Dilip.TS [mailto:[EMAIL PROTECTED] Envoyé : lundi 19 novembre 2007 07:09 À : solr-user@lucene.apache.org Objet : RE: I18N with SOLR? Hello, Also can we have something like this ? i.e having multiple defaultSearchField entries in the schema.xml while searching for a keyword which has a combination of more than 1 language: defaultSearchFieldtext/defaultSearchField defaultSearchFieldtext_french/defaultSearchField... -Original Message- From: Dilip.TS [mailto:[EMAIL PROTECTED] Sent: Monday, November 19, 2007 11:29 AM To: solr-user@lucene.apache.org Subject: RE: I18N with SOLR? Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? -Original Message- From: Guglielmo Celata [mailto:[EMAIL PROTECTED] Sent: Thursday, November 15, 2007 7:39 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? Hi Dillip, don't know if this helps, but I have set up a TextIt field in the config/schema.xml file, in order to index italian text. It works pretty well with non-ascii characters (we do have some accented vowels, even if not as many as the french). It also works with stopwords (and I assume with protwords as well, though I didn't try). I created an italian-stopwords.txt file in the config/ path. I think the SnowballPorterFilterFactory is a default usable class in Solr, although I remember having read it's a bit slower than other libraries. But I am no expert. fieldtype name=textIt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumber s=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=italian-stopwords.txt ignoreCase=true/ filter class=solr.SnowballPorterFilterFactory language=Italian/ /analyzer /fieldtype On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote: Hi Ed, Thanks for the help, but i have some queries, i understand that we need to have a stopwords_french.txt and protwords_french.txt files say for french in solr/conf directory. Is it like we need to write the classes like FrenchStopFilterFactory, FrenchPorterFilterFactory for each language or do we have these classes in built in solr? I didnt find them in SOLR/Lucene APIs. I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer etc., in lucene-analyzers.jar. Any idea what is this class used for? Thanks in advance, Regards Dilip -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed Summers Sent: Monday, November 12, 2007 7:00 PM To: solr-user@lucene.apache.org ; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class= solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know
Re: I18N with SOLR?
On 18-Nov-07, at 9:59 PM, Dilip.TS wrote: Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? Sure: Solr is totally language-agnostic. -Mike
RE: I18N with SOLR?
Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? -Original Message- From: Guglielmo Celata [mailto:[EMAIL PROTECTED] Sent: Thursday, November 15, 2007 7:39 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? Hi Dillip, don't know if this helps, but I have set up a TextIt field in the config/schema.xml file, in order to index italian text. It works pretty well with non-ascii characters (we do have some accented vowels, even if not as many as the french). It also works with stopwords (and I assume with protwords as well, though I didn't try). I created an italian-stopwords.txt file in the config/ path. I think the SnowballPorterFilterFactory is a default usable class in Solr, although I remember having read it's a bit slower than other libraries. But I am no expert. fieldtype name=textIt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumber s=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=italian-stopwords.txt ignoreCase=true/ filter class=solr.SnowballPorterFilterFactory language=Italian/ /analyzer /fieldtype On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote: Hi Ed, Thanks for the help, but i have some queries, i understand that we need to have a stopwords_french.txt and protwords_french.txt files say for french in solr/conf directory. Is it like we need to write the classes like FrenchStopFilterFactory, FrenchPorterFilterFactory for each language or do we have these classes in built in solr? I didnt find them in SOLR/Lucene APIs. I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer etc., in lucene-analyzers.jar. Any idea what is this class used for? Thanks in advance, Regards Dilip -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed Summers Sent: Monday, November 12, 2007 7:00 PM To: solr-user@lucene.apache.org ; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class= solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite effectively at the Library of Congress recently for a prototype application that needed to provide search functionality in 7 different languages. //Ed On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hello, Does SOLR supports I18N (with multiple language support) ? Thanks in advance. Regards, Dilip TS
RE: I18N with SOLR?
Hello, Also can we have something like this ? i.e having multiple defaultSearchField entries in the schema.xml while searching for a keyword which has a combination of more than 1 language: defaultSearchFieldtext/defaultSearchField defaultSearchFieldtext_french/defaultSearchField... -Original Message- From: Dilip.TS [mailto:[EMAIL PROTECTED] Sent: Monday, November 19, 2007 11:29 AM To: solr-user@lucene.apache.org Subject: RE: I18N with SOLR? Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? -Original Message- From: Guglielmo Celata [mailto:[EMAIL PROTECTED] Sent: Thursday, November 15, 2007 7:39 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? Hi Dillip, don't know if this helps, but I have set up a TextIt field in the config/schema.xml file, in order to index italian text. It works pretty well with non-ascii characters (we do have some accented vowels, even if not as many as the french). It also works with stopwords (and I assume with protwords as well, though I didn't try). I created an italian-stopwords.txt file in the config/ path. I think the SnowballPorterFilterFactory is a default usable class in Solr, although I remember having read it's a bit slower than other libraries. But I am no expert. fieldtype name=textIt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumber s=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=italian-stopwords.txt ignoreCase=true/ filter class=solr.SnowballPorterFilterFactory language=Italian/ /analyzer /fieldtype On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote: Hi Ed, Thanks for the help, but i have some queries, i understand that we need to have a stopwords_french.txt and protwords_french.txt files say for french in solr/conf directory. Is it like we need to write the classes like FrenchStopFilterFactory, FrenchPorterFilterFactory for each language or do we have these classes in built in solr? I didnt find them in SOLR/Lucene APIs. I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer etc., in lucene-analyzers.jar. Any idea what is this class used for? Thanks in advance, Regards Dilip -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed Summers Sent: Monday, November 12, 2007 7:00 PM To: solr-user@lucene.apache.org ; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class= solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite
Re: I18N with SOLR?
I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class=solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite effectively at the Library of Congress recently for a prototype application that needed to provide search functionality in 7 different languages. //Ed On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hello, Does SOLR supports I18N (with multiple language support) ? Thanks in advance. Regards, Dilip TS