Re: looking for multilanguage indexing best practice/hint

Julian Davchev Sun, 21 Dec 2008 02:14:15 -0800

Dude,
There was already a warning with stealing thread. Please do something
about it as advised.  Run your own if want answers for your problem.
Cheers,



Sujatha Arun wrote:
> Thanks Daniel and Erik,
>
> The requirement from the user end is to only search in that particular
> language and not across languages.
>
> Also going forward we will be adding more languages.
>
> so if i have separate fields for each language ,then we need to change the
> schema everytime and that will not scale very well.
>
> So there are two options ,either use dynamic fields  or use multi core .
>
> Please advice which is better in terms of scaling ,optimum use of existing
> resources (available  ram which is abt 4GB for several instances of solr) .
>
> If we use multicore ,will it degrade in terms of speed etc?
>
> Any pointers will be helpful
>
> Regards
> Sujatha
>
>
>
>
> On 12/19/08, Julian Davchev <j...@drun.net> wrote:
>   
>> Thanks Erick,
>> I think I will go with different language fields as I want to give
>> different stop words, analyzers etc.
>> I might also consider scheme per language so scaling is more flexible as
>> I was already advised but this will really make sense if I have more
>> than one server I guess, else just all other data is duplicated for no
>> reason.
>> We already made decision that language will be passed each time in
>> search so won't make sense to search quert in any lang.
>>
>> As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
>> tried yet) and since I am noob in java will check how it's done.
>> Will definately give a try.
>>
>> Thanks alot for help.
>>
>> Erick Erickson wrote:
>>     
>>> See the CJKAnalyzer for a start, StandardAnalyzer won't
>>> help you much.
>>>
>>> Also, tell us a little more about your requirements. For instance,
>>> if a user submits a query in Japanese, do you want to search
>>> across documents in the other languages too? And will you want
>>> to associate different analyzers with the content from different
>>> languages? You really have two options:
>>>
>>> if you want different analyzers used with the different languages,
>>> you probably have to index the content in different fields. That is
>>> a Chinese document would have a chinese_content field, a Japanese
>>> document would have a japanese_content field etc. Now you can
>>> associate a different analyzer with each *_content field.
>>>
>>> If the same analyzer would work for all three languages, you
>>> can just index all the content in a "content" field, and if you
>>> need to restrict searching to the language in which the query
>>> was submitted, you could always add a clause on the
>>> language, e.g. AND language:chinese
>>>
>>> Hope this helps
>>> Erick
>>>
>>> On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <suja.a...@gmail.com>
>>>       
>> wrote:
>>     
>>>       
>>>> Hi,
>>>>
>>>> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
>>>> schema -id,content and language.
>>>>
>>>> I am indexing 3 pdf files ,the languages are foroyo,chinese and
>>>>         
>> japanese.
>>     
>>>> I use xpdf to convert the content of pdf to text and push the text to
>>>>         
>> solr
>>     
>>>> in the content field.
>>>>
>>>> What is the analyzer  that i need to use for the above.
>>>>
>>>> By using the default text analyzer and posting this content to solr, i
>>>>         
>> am
>>     
>>>> not getting any  results.
>>>>
>>>> Does solr support stemmin for the above languages.
>>>>
>>>> Regards
>>>> Sujatha
>>>>
>>>>
>>>>
>>>>
>>>> On 12/18/08, Feak, Todd <todd.f...@smss.sony.com> wrote:
>>>>
>>>>         
>>>>> Don't forget to consider scaling concerns (if there are any). There are
>>>>> strong differences in the number of searches we receive for each
>>>>> language. We chose to create separate schema and config per language so
>>>>> that we can throw servers at a particular language (or set of
>>>>>           
>> languages)
>>     
>>>>> if we needed to. We see 2 orders of magnitude difference between our
>>>>> most popular language and our least popular.
>>>>>
>>>>> -Todd Feak
>>>>>
>>>>> -----Original Message-----
>>>>> From: Julian Davchev [mailto:j...@drun.net]
>>>>> Sent: Wednesday, December 17, 2008 11:31 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: looking for multilanguage indexing best practice/hint
>>>>>
>>>>> Hi,
>>>>> From my study on solr and lucene so far it seems that I will use single
>>>>> scheme.....at least don't see scenario where I'd need more than that.
>>>>> So question is how do I approach multilanguage indexing and multilang
>>>>> searching. Will it really make sense for just searching word..or rather
>>>>> I should supply lang param to search as well.
>>>>>
>>>>> I see there are those filters and already advised on them but I guess
>>>>> question is more of a best practice.
>>>>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>>>>>
>>>>> So solution I see is using copyField I have same field in different
>>>>> langs or something using distinct filter.
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>       
>>     
>
>

Re: looking for multilanguage indexing best practice/hint

Reply via email to