Hi Yonik.

About how to handle with the index in query time:

I think that if you don't inform a language, you can return any document
matching the term, without considering different languages (if it's
possible) or if it's interesting for your solution, you can define a default
language to be used when it's not informed explicitly on the query.

So the analyzer has to be able to deal with a no-specific language situation
(that I think it's only acceptable at query time)...

Do you think it's doable?

It could be applied for the scenario Kuro explained (documents translated
into different languages) or for my actual scenario (different contents with
the same structure in different languages).
 

Regards,
Daniel


On 12/6/07 16:30, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> On 6/12/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
>> For bi-lingual
>> or tri-lingual search, we can have parallel fields (title_en,
>> title_fr, title_de, for example) but this wouldn't scale well.
> 
> Due to search across multiple fields, or due to increased index size?
> 
>> Lucene and Solr
>> requires that the language be known before an Analyzer can be
>> instantiated,and it's the Analyzer that detects the language in my
>> design....  A second obstacle is that the kinds of Filters
>> the Analyzer use depends on the language, so it must be
>> dynamically changed. This could be done programatically but
>> it's not easy.  My big hope is that we can work together to
>> come up with some way so that the detected language within
>> the Analayzer can somehow be retrieved and made it into the field.
> 
> Something could be done for the indexing side of things, but then how
> do you query?
> Would you be able to do language detection on single word queries, or
> do you apply multiple analyzers and query the same field multiple ways
> (which seems very close to the multiple field approach)?
> 
> Also, would multiple languages in a single field perhaps cause idf skew?
> 
> 50 languages is a lot... perhaps a simple analyzer that could just try
> to break into words and lowercase?
> 
> 
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                        

Reply via email to