Re: how to Index and Search non-Eglish Text in solr

Mohammad Shariq Thu, 09 Jun 2011 02:27:28 -0700

Can I specify multiple language in filter tag in schema.xml ???  like below


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.
WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
<filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<tokenizer class="solr.CJKTokenizerFactory"/>



      <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.SnowballPorterFilterFactory" language="Hungarian" />


On 8 June 2011 18:47, Erick Erickson <erickerick...@gmail.com> wrote:

> This page is a handy reference for individual languages...
> http://wiki.apache.org/solr/LanguageAnalysis
>
> But the usual approach, especially for Chinese/Japanese/Korean
> (CJK) is to index the content in different fields with language-specific
> analyzers then spread your search across the language-specific
> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
> particularly give "surprising" results if you put words from different
> languages in the same field.
>
> Best
> Erick
>
> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <shariqn...@gmail.com>
> wrote:
> > Hi,
> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> > English, but my requirement extend to index the news of other languages
> too.
> >
> > This is how my schema looks :
> > <field name="news" type="text" indexed="true" stored="false"
> > required="false"/>
> >
> >
> > And the "text" Field in schema.xml looks like :
> >
> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> >    <analyzer type="index">
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >       <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="1"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >    </analyzer>
> >    <analyzer type="query">
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >       <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="1"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >    </analyzer>
> > </fieldType>
> >
> >
> > My Problem is :
> > Now I want to index the news articles in other languages to e.g.
> > Chinese,Japnese.
> > How I can I modify my text field so that I can Index the news in other
> lang
> > too and make it searchable ??
> >
> > Thanks
> > Shariq
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: how to Index and Search non-Eglish Text in solr

Reply via email to