Re: Stemming and other tokenizers

2011-09-20 Thread Pranav Prakash
I have a similar use case, but slightly more flexible and straight forward.
In my case, I have a field language which stores 'en', 'es' or whatever
the language of the document is. Then the field 'transcript' stores the
actual content which is in the language as described in language field.
Following up with the conversation, is this how I am supposed to proceed:

   1. Create one field type in my schema per supported language. This would
   cause me to create ~30 fields.
   2. Since, I already know the language of my content, I can skip SOLR-1979
   (which is expected in Solr 3.5)

The point where I am unclear is, how do I specify at Index time, to use a
certain field for a certain language?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Mon, Sep 12, 2011 at 20:55, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 Do they? Can you explain the layout of the documents?

 There are two ways to handle multi lingual docs. If all your docs have both
 an English and a Norwegian version, you may either split these into two
 separate documents, each with the language field filled by LangId - which
 then also lets you filter by language. Or you may assign a title_en and
 title_no to the same document (expand with more fields if you have more
 languages per document), and keep it as one document. Your client will then
 be adapted to search the language(s) that the user wants.

 If one document has multiple languages within the same field, e.g. body,
 say one paragraph of English and the next is Norwegian, then we currently do
 not have any capability in Solr to apply different analysis (tokenization,
 stemming etc) to each paragraph.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 12. sep. 2011, at 11:37, Manish Bafna wrote:

  What is single document has multiple languages?
 
  On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl jan@cominvent.com
 wrote:
 
  Hi
 
  Everybody else use dedicated field per language, so why can't you?
  Please explain your use case, and perhaps we can better help understand
  what you're trying to do.
  Do you always know the query language in advance?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
 
  I can't create one field per language, that is the problem but I'll dig
  into
  it following your indications.
  I let you know what I could come out with.
 
  Patrick.
 
  2011/9/11 Jan Høydahl jan@cominvent.com
 
  Hi,
 
  You'll not be able to detect language and change stemmer on the same
  field
  in one go. You need to create one fieldType in your schema per
 language
  you
  want to use, and then use LanguageIdentification (SOLR-1979) to do the
  magic
  of detecting language and renaming the field. If you set
  langid.override=false, languid.map=true and populate your language
  field
  with the known language, you will probably get the desired effect.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
 
  Hello,
 
 
 
  I want to implement some king of AutoStemming that will detect the
  language
  of a field based on a tag at the start of this field like #en# my
 field
  is
  stored on disc but I don't want this tag to be stored. Is there a way
  to
  avoid this field to be stored ?
 
  To me all the filters and the tokenizers interact only with the
 indexed
  field and not the stored one.
 
  Am I wrong ?
 
  Is it possible to you to do such a filter.
 
 
 
  Patrick.
 
 
 
 
 




Re: Stemming and other tokenizers

2011-09-12 Thread Patrick Sauts
I can't create one field per language, that is the problem but I'll dig into
it following your indications.
I let you know what I could come out with.

Patrick.

2011/9/11 Jan Høydahl jan@cominvent.com

 Hi,

 You'll not be able to detect language and change stemmer on the same field
 in one go. You need to create one fieldType in your schema per language you
 want to use, and then use LanguageIdentification (SOLR-1979) to do the magic
 of detecting language and renaming the field. If you set
 langid.override=false, languid.map=true and populate your language field
 with the known language, you will probably get the desired effect.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 10. sep. 2011, at 03:24, Patrick Sauts wrote:

  Hello,
 
 
 
  I want to implement some king of AutoStemming that will detect the
 language
  of a field based on a tag at the start of this field like #en# my field
 is
  stored on disc but I don't want this tag to be stored. Is there a way to
  avoid this field to be stored ?
 
  To me all the filters and the tokenizers interact only with the indexed
  field and not the stored one.
 
  Am I wrong ?
 
  Is it possible to you to do such a filter.
 
 
 
  Patrick.
 




Re: Stemming and other tokenizers

2011-09-12 Thread Jan Høydahl
Hi

Everybody else use dedicated field per language, so why can't you?
Please explain your use case, and perhaps we can better help understand what 
you're trying to do.
Do you always know the query language in advance?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 08:28, Patrick Sauts wrote:

 I can't create one field per language, that is the problem but I'll dig into
 it following your indications.
 I let you know what I could come out with.
 
 Patrick.
 
 2011/9/11 Jan Høydahl jan@cominvent.com
 
 Hi,
 
 You'll not be able to detect language and change stemmer on the same field
 in one go. You need to create one fieldType in your schema per language you
 want to use, and then use LanguageIdentification (SOLR-1979) to do the magic
 of detecting language and renaming the field. If you set
 langid.override=false, languid.map=true and populate your language field
 with the known language, you will probably get the desired effect.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
 
 Hello,
 
 
 
 I want to implement some king of AutoStemming that will detect the
 language
 of a field based on a tag at the start of this field like #en# my field
 is
 stored on disc but I don't want this tag to be stored. Is there a way to
 avoid this field to be stored ?
 
 To me all the filters and the tokenizers interact only with the indexed
 field and not the stored one.
 
 Am I wrong ?
 
 Is it possible to you to do such a filter.
 
 
 
 Patrick.
 
 
 



Re: Stemming and other tokenizers

2011-09-12 Thread Manish Bafna
What is single document has multiple languages?

On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl jan@cominvent.com wrote:

 Hi

 Everybody else use dedicated field per language, so why can't you?
 Please explain your use case, and perhaps we can better help understand
 what you're trying to do.
 Do you always know the query language in advance?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 12. sep. 2011, at 08:28, Patrick Sauts wrote:

  I can't create one field per language, that is the problem but I'll dig
 into
  it following your indications.
  I let you know what I could come out with.
 
  Patrick.
 
  2011/9/11 Jan Høydahl jan@cominvent.com
 
  Hi,
 
  You'll not be able to detect language and change stemmer on the same
 field
  in one go. You need to create one fieldType in your schema per language
 you
  want to use, and then use LanguageIdentification (SOLR-1979) to do the
 magic
  of detecting language and renaming the field. If you set
  langid.override=false, languid.map=true and populate your language
 field
  with the known language, you will probably get the desired effect.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
 
  Hello,
 
 
 
  I want to implement some king of AutoStemming that will detect the
  language
  of a field based on a tag at the start of this field like #en# my field
  is
  stored on disc but I don't want this tag to be stored. Is there a way
 to
  avoid this field to be stored ?
 
  To me all the filters and the tokenizers interact only with the indexed
  field and not the stored one.
 
  Am I wrong ?
 
  Is it possible to you to do such a filter.
 
 
 
  Patrick.
 
 
 




Re: Stemming and other tokenizers

2011-09-12 Thread Jan Høydahl
Hi,

Do they? Can you explain the layout of the documents? 

There are two ways to handle multi lingual docs. If all your docs have both an 
English and a Norwegian version, you may either split these into two separate 
documents, each with the language field filled by LangId - which then also 
lets you filter by language. Or you may assign a title_en and title_no to the 
same document (expand with more fields if you have more languages per 
document), and keep it as one document. Your client will then be adapted to 
search the language(s) that the user wants.

If one document has multiple languages within the same field, e.g. body, say 
one paragraph of English and the next is Norwegian, then we currently do not 
have any capability in Solr to apply different analysis (tokenization, stemming 
etc) to each paragraph.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 11:37, Manish Bafna wrote:

 What is single document has multiple languages?
 
 On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl jan@cominvent.com wrote:
 
 Hi
 
 Everybody else use dedicated field per language, so why can't you?
 Please explain your use case, and perhaps we can better help understand
 what you're trying to do.
 Do you always know the query language in advance?
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
 
 I can't create one field per language, that is the problem but I'll dig
 into
 it following your indications.
 I let you know what I could come out with.
 
 Patrick.
 
 2011/9/11 Jan Høydahl jan@cominvent.com
 
 Hi,
 
 You'll not be able to detect language and change stemmer on the same
 field
 in one go. You need to create one fieldType in your schema per language
 you
 want to use, and then use LanguageIdentification (SOLR-1979) to do the
 magic
 of detecting language and renaming the field. If you set
 langid.override=false, languid.map=true and populate your language
 field
 with the known language, you will probably get the desired effect.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
 
 Hello,
 
 
 
 I want to implement some king of AutoStemming that will detect the
 language
 of a field based on a tag at the start of this field like #en# my field
 is
 stored on disc but I don't want this tag to be stored. Is there a way
 to
 avoid this field to be stored ?
 
 To me all the filters and the tokenizers interact only with the indexed
 field and not the stored one.
 
 Am I wrong ?
 
 Is it possible to you to do such a filter.
 
 
 
 Patrick.
 
 
 
 
 



Re: Stemming and other tokenizers

2011-09-11 Thread Jan Høydahl
Hi,

You'll not be able to detect language and change stemmer on the same field in 
one go. You need to create one fieldType in your schema per language you want 
to use, and then use LanguageIdentification (SOLR-1979) to do the magic of 
detecting language and renaming the field. If you set langid.override=false, 
languid.map=true and populate your language field with the known language, 
you will probably get the desired effect.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. sep. 2011, at 03:24, Patrick Sauts wrote:

 Hello,
 
 
 
 I want to implement some king of AutoStemming that will detect the language
 of a field based on a tag at the start of this field like #en# my field is
 stored on disc but I don't want this tag to be stored. Is there a way to
 avoid this field to be stored ?
 
 To me all the filters and the tokenizers interact only with the indexed
 field and not the stored one.
 
 Am I wrong ?
 
 Is it possible to you to do such a filter.
 
 
 
 Patrick.
 



Stemming and other tokenizers

2011-09-09 Thread Patrick Sauts
Hello,

 

I want to implement some king of AutoStemming that will detect the language
of a field based on a tag at the start of this field like #en# my field is
stored on disc but I don't want this tag to be stored. Is there a way to
avoid this field to be stored ?

To me all the filters and the tokenizers interact only with the indexed
field and not the stored one.

Am I wrong ?

Is it possible to you to do such a filter.

 

Patrick.