RE: I18N with SOLR?

2007-11-19 Thread SDIS M. Beauchamp
You can have only one default search field 

But you can use the dismax request handler to search across several fields
http://wiki.apache.org/solr/DisMaxRequestHandler

Then you can use query field boosting to make one field more significant  :

Exact_text^3 text_fr^2 text_en^2 stemmed_text^1.5

-Message d'origine-
De : Dilip.TS [mailto:[EMAIL PROTECTED] 
Envoyé : lundi 19 novembre 2007 07:09
À : solr-user@lucene.apache.org
Objet : RE: I18N with SOLR?

   Hello,

  Also can we have something like this ? i.e  having multiple 
defaultSearchField entries in the schema.xml while searching for a keyword 
which has a combination of more than 1 language:

  defaultSearchFieldtext/defaultSearchField
  defaultSearchFieldtext_french/defaultSearchField...
  -Original Message-
  From: Dilip.TS [mailto:[EMAIL PROTECTED]
  Sent: Monday, November 19, 2007 11:29 AM
  To: solr-user@lucene.apache.org
  Subject: RE: I18N with SOLR?


Hello,
Does SOLR supports searching for a keyword which has a 
combination of more than 1 language within the same search page?



-Original Message-
From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 15, 2007 7:39 PM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: I18N with SOLR?


Hi Dillip,
don't know if this helps, but I have set up a TextIt field in the 
config/schema.xml file, in order to index italian text.
It works pretty well with non-ascii characters (we do have some accented 
vowels, even if not as many as the french).
It also works with  stopwords (and I assume with protwords as well, though 
I didn't try). I created an italian-stopwords.txt file in the config/ path.
I think the SnowballPorterFilterFactory is a default usable class in Solr, 
although I remember having read it's a bit slower than other libraries.
But I am no expert.


fieldtype name=textIt class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumber
s=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=italian-stopwords.txt ignoreCase=true/
filter class=solr.SnowballPorterFilterFactory
language=Italian/
  /analyzer
/fieldtype



On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote:
  Hi Ed,
Thanks for the help,  but i have some queries,
i understand that we need to have a stopwords_french.txt and
  protwords_french.txt files say for french in solr/conf directory.
Is it like we need to write the classes like FrenchStopFilterFactory,
  FrenchPorterFilterFactory for each language
or do we have these classes in built in solr? I didnt find them in
  SOLR/Lucene APIs.
I found some classes like
org.apache.lucene.analysis.fr.FrenchAnalyzer
  etc., in lucene-analyzers.jar.
Any idea what is this class used for?

  Thanks in advance,

  Regards
  Dilip

  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed
  Summers
  Sent: Monday, November 12, 2007 7:00 PM
  To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
  Subject: Re: I18N with SOLR?


  I'd say yes. Solr supports Unicode and ships with language specific
  analyzers, and allows you to provide your own custom analyzers if you
  need them. This allows you to create different fieldType definitions
  for the languages you want to support. For example here is an example
  field type for French text which uses a French stopword list and
  French stemming.

  fieldType
name=text_french
class=solr.TextField 
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory  /
  filter
class=solr.FrenchStopFilterFactory
ignoreCase=true
words=stopwords_french.txt /
  filter
class= solr.FrenchPorterFilterFactory
protected=protwords_french.txt /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
  /fieldType

  Then you can create a dynamicField definitions that allow you to
  index and query your documents using the correct field type:

  dynamicField
name=*_french
type=text_french
indexed=true
stored=true/

  This means that when you index you need to know what language your
  data is in so that you know what field names to use in your document
  (e.g. title_french). And at search time you need to know

Re: I18N with SOLR?

2007-11-19 Thread Mike Klaas

On 18-Nov-07, at 9:59 PM, Dilip.TS wrote:


  Hello,
  Does SOLR supports searching for a keyword which  
has a

combination of more than 1 language within the same search page?


Sure: Solr is totally language-agnostic.

-Mike


RE: I18N with SOLR?

2007-11-18 Thread Dilip.TS
  Hello,
  Does SOLR supports searching for a keyword which has a
combination of more than 1 language within the same search page?



  -Original Message-
  From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 15, 2007 7:39 PM
  To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
  Subject: Re: I18N with SOLR?


  Hi Dillip,
  don't know if this helps, but I have set up a TextIt field in the
config/schema.xml file, in order to index italian text.
  It works pretty well with non-ascii characters (we do have some accented
vowels, even if not as many as the french).
  It also works with  stopwords (and I assume with protwords as well, though
I didn't try). I created an italian-stopwords.txt file in the config/ path.
  I think the SnowballPorterFilterFactory is a default usable class in Solr,
although I remember having read it's a bit slower than other libraries.
  But I am no expert.


  fieldtype name=textIt class=solr.TextField
positionIncrementGap=100
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumber
  s=1 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory
words=italian-stopwords.txt ignoreCase=true/
  filter class=solr.SnowballPorterFilterFactory
language=Italian/
/analyzer
  /fieldtype



  On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote:
Hi Ed,
  Thanks for the help,  but i have some queries,
  i understand that we need to have a stopwords_french.txt and
protwords_french.txt files say for french in solr/conf directory.
  Is it like we need to write the classes like FrenchStopFilterFactory,
FrenchPorterFilterFactory for each language
  or do we have these classes in built in solr? I didnt find them in
SOLR/Lucene APIs.
  I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer
etc., in lucene-analyzers.jar.
  Any idea what is this class used for?

Thanks in advance,

Regards
Dilip

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed
Summers
Sent: Monday, November 12, 2007 7:00 PM
To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
Subject: Re: I18N with SOLR?


I'd say yes. Solr supports Unicode and ships with language specific
analyzers, and allows you to provide your own custom analyzers if you
need them. This allows you to create different fieldType definitions
for the languages you want to support. For example here is an example
field type for French text which uses a French stopword list and
French stemming.

fieldType
  name=text_french
  class=solr.TextField 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory  /
filter
  class=solr.FrenchStopFilterFactory
  ignoreCase=true
  words=stopwords_french.txt /
filter
  class= solr.FrenchPorterFilterFactory
  protected=protwords_french.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Then you can create a dynamicField definitions that allow you to
index and query your documents using the correct field type:

dynamicField
  name=*_french
  type=text_french
  indexed=true
  stored=true/

This means that when you index you need to know what language your
data is in so that you know what field names to use in your document
(e.g. title_french). And at search time you need to know what language
you are in so you know which fields to search.  Most user interfaces
are in a single language context so from the query perspective you'll
most likely know the language they want to search in. If you don't
know the language context in either case you could try to guess using
something like org.apache.nutch.analysis.lang.LanguageIdentifier.

I hope this helps. We used this technique (without the guessing) quite
effectively at the Library of Congress recently for a prototype
application that needed to provide search functionality in 7 different
languages.

//Ed

On Nov 12, 2007 1:56 AM, Dilip.TS  [EMAIL PROTECTED] wrote:
 Hello,

   Does SOLR supports I18N (with multiple language support) ?
   Thanks in advance.

 Regards,
 Dilip TS







RE: I18N with SOLR?

2007-11-18 Thread Dilip.TS
   Hello,

  Also can we have something like this ? i.e  having multiple
defaultSearchField entries in the schema.xml while searching for a keyword
which has a combination of more than 1 language:

  defaultSearchFieldtext/defaultSearchField
  defaultSearchFieldtext_french/defaultSearchField...
  -Original Message-
  From: Dilip.TS [mailto:[EMAIL PROTECTED]
  Sent: Monday, November 19, 2007 11:29 AM
  To: solr-user@lucene.apache.org
  Subject: RE: I18N with SOLR?


Hello,
Does SOLR supports searching for a keyword which has a
combination of more than 1 language within the same search page?



-Original Message-
From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 15, 2007 7:39 PM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: I18N with SOLR?


Hi Dillip,
don't know if this helps, but I have set up a TextIt field in the
config/schema.xml file, in order to index italian text.
It works pretty well with non-ascii characters (we do have some accented
vowels, even if not as many as the french).
It also works with  stopwords (and I assume with protwords as well,
though I didn't try). I created an italian-stopwords.txt file in the config/
path.
I think the SnowballPorterFilterFactory is a default usable class in
Solr, although I remember having read it's a bit slower than other
libraries.
But I am no expert.


fieldtype name=textIt class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumber
s=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=italian-stopwords.txt ignoreCase=true/
filter class=solr.SnowballPorterFilterFactory
language=Italian/
  /analyzer
/fieldtype



On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote:
  Hi Ed,
Thanks for the help,  but i have some queries,
i understand that we need to have a stopwords_french.txt and
  protwords_french.txt files say for french in solr/conf directory.
Is it like we need to write the classes like
FrenchStopFilterFactory,
  FrenchPorterFilterFactory for each language
or do we have these classes in built in solr? I didnt find them in
  SOLR/Lucene APIs.
I found some classes like
org.apache.lucene.analysis.fr.FrenchAnalyzer
  etc., in lucene-analyzers.jar.
Any idea what is this class used for?

  Thanks in advance,

  Regards
  Dilip

  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Ed
  Summers
  Sent: Monday, November 12, 2007 7:00 PM
  To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
  Subject: Re: I18N with SOLR?


  I'd say yes. Solr supports Unicode and ships with language specific
  analyzers, and allows you to provide your own custom analyzers if you
  need them. This allows you to create different fieldType definitions
  for the languages you want to support. For example here is an example
  field type for French text which uses a French stopword list and
  French stemming.

  fieldType
name=text_french
class=solr.TextField 
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory  /
  filter
class=solr.FrenchStopFilterFactory
ignoreCase=true
words=stopwords_french.txt /
  filter
class= solr.FrenchPorterFilterFactory
protected=protwords_french.txt /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
  /fieldType

  Then you can create a dynamicField definitions that allow you to
  index and query your documents using the correct field type:

  dynamicField
name=*_french
type=text_french
indexed=true
stored=true/

  This means that when you index you need to know what language your
  data is in so that you know what field names to use in your document
  (e.g. title_french). And at search time you need to know what language
  you are in so you know which fields to search.  Most user interfaces
  are in a single language context so from the query perspective you'll
  most likely know the language they want to search in. If you don't
  know the language context in either case you could try to guess using
  something like org.apache.nutch.analysis.lang.LanguageIdentifier.

  I hope this helps. We used this technique (without the guessing) quite

Re: I18N with SOLR?

2007-11-12 Thread Ed Summers
I'd say yes. Solr supports Unicode and ships with language specific
analyzers, and allows you to provide your own custom analyzers if you
need them. This allows you to create different fieldType definitions
for the languages you want to support. For example here is an example
field type for French text which uses a French stopword list and
French stemming.

fieldType
  name=text_french
  class=solr.TextField 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter
  class=solr.FrenchStopFilterFactory
  ignoreCase=true
  words=stopwords_french.txt /
filter
  class=solr.FrenchPorterFilterFactory
  protected=protwords_french.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Then you can create a dynamicField definitions that allow you to
index and query your documents using the correct field type:

dynamicField
  name=*_french
  type=text_french
  indexed=true
  stored=true/

This means that when you index you need to know what language your
data is in so that you know what field names to use in your document
(e.g. title_french). And at search time you need to know what language
you are in so you know which fields to search.  Most user interfaces
are in a single language context so from the query perspective you'll
most likely know the language they want to search in. If you don't
know the language context in either case you could try to guess using
something like org.apache.nutch.analysis.lang.LanguageIdentifier.

I hope this helps. We used this technique (without the guessing) quite
effectively at the Library of Congress recently for a prototype
application that needed to provide search functionality in 7 different
languages.

//Ed

On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote:
 Hello,

   Does SOLR supports I18N (with multiple language support) ?
   Thanks in advance.

 Regards,
 Dilip TS