Re: Multi-language solr1.3 what would you reckon?

John E. McBride Mon, 13 Oct 2008 07:24:30 -0700

In your schema you define each field as follows:

<fieldtype name="text_it" class="solr.TextField">
−
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian"/>
</analyzer>
</fieldtype>

etc

However, you have not defined the query filters - if you do not thisthen you will not get any matches for searches in different languages.

for example, in english if you index the sentence "the joyful boy playedtennis", this would typically get stored as "joy boy play tennis" due tothe analysis filters. If you then made a query for "joyful" withoutapplying the same filters on the query side you would get no matches.

You will also want to get some multilingual stop words lists fromsnowball website eg http://snowball.tartarus.org/algorithms/german/stop.txt.


sunnyfr wrote:

What is the problem with the way that I've done,Does that's means that there is some which are linked with language that we
won't manage by search,
there is too many language, the application will be for video,
we will manage around 10 language, but in our database we have around  25
language,Should i create a core text and others like text_en, text_fr, text_es, and
all the video which are not in this language manage by the search engine
should be stored in text ?

Because even if they are on the english website they should be able if they
enter a french word "chien" for "dog"
to find french videos.
I don't know if I'm clear??

and even so text should manage all the other language which are not managed
in the other cores ??
thanks


John E. McBride wrote:
Well, it's this section shown below, which would change from geographyto geography.
Parameterise the EnglishPorterFilterFactory and protwords.
You could introduce logic in the front end which asks if num results iszero then makes a call to the english language, but it doesn't makelogical sense? why would a search in the italian language bring upanything in the english index?
I think you need to explain your application in a little more detail.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
-
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
-

-
<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" enablePositionIncrements="true"/><filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
-
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="0" catenateNumbers="0"catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

sunnyfr wrote:
Hi,

Thanks guys for your answer, but I don't think I can use multi-core for
each
language,because for exemple if somebody is connected from Italia and if there is
not
that much Italian's book,
so by default I will show up few italian books but all the english one as
well.
Do you have an example ?I'm quite lost about it,
John E. McBride wrote:
Fairly nebulous requirements, but I recently was involved in amultilingual search platform.
The approach, translated to solr 1.3 would be to use multicore - onecore per geography. Then a schema.xml per core, each with a differentlanguage in the porter algorithm, stopwords etc - taken from snowball.
Then on the german front end you make requests to the de core, on theenglish front end make requests to the english core.
This is much simpler than sorting every language in the one index, forexample german queries will need to be run through the german queryfilters etc. If you have all languages in one schema, then you willhave to do some front end logic to map the query to the correct field.
You have failed to consider internationalisation of the query side ofthe process - your field type merely have analysis filters.Additionally, if the data source for each different geography isdifferent it makes sense to separate the indexes and subsequently theingestion mechanisms and schedules.
Just a few thoughts.

John

sunnyfr wrote:
Hi,

I would like to manage properly multi language search motor,
I would like your advice about what have I done.

Solr1.3
tomcat55
http://www.nabble.com/file/p19954805/schema.xml schema.xml
Thanks a lot,

Re: Multi-language solr1.3 what would you reckon?

Reply via email to