Re: Inverse English an digits in Arabic Text

Alexandre Rafalovitch Mon, 07 Sep 2020 04:11:10 -0700

> Doc in Arabic with some English - English text is inverted (for example,
"gro.echapa.www"), what makes search by key words impossible.


What very specifically do you mean by that. How do you see the inversion?

If that's within some sort of web ui, then you are probably seeing the HTML
bidi (bidirectional LTR/RTL) presentation issues.

And if you are seeing in in Cloudera UI, then the question may be for their
forum.

One way to test is to have English text in brackets "(www.apache.org)"
within Arabic flow. If you see again your issue but the brackets get weird
"((gro.....", this is most likely a bidi presentation issue with algorithm
or HTML attribute set to RTL.

Could be something else though, but that would be a start point.

Regards,
    Alex


On Mon., Sep. 7, 2020, 5:54 a.m. , <ad...@ukr.net> wrote:

> Hi,
>
> Could please help to resolve an issue. I upload/index several documents in
> English and in Arabic languages to SOLR, in addition I use handler for
> Arabic language:
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
>    <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>                          <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                          <filter
> class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>                          <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                           <filter
> class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
> There are two environments:
> Local machine:                 - SOLR version: 4,2
>                 - Windows version: 10
>
> DEV env:                 - SOLR version 4.1 as part of the cloudera suit
>                 - Linux core version: 3.10.0-862
>
> Issue appears when uploading documents:
> Local machine:                 - Doc in English with English words only -
> ok (for example, "www.apache.org")
>                 - Doc in Arabic with some English words - ok (for example,
> "www.apache.org")
>
> DEV env:                 - Doc in English with English words only - ok
> (for example, "www.apache.org")
>                 - Doc in Arabic with some English - English text is
> inverted (for example, "gro.echapa.www"), what makes search by key words
> impossible.
>
> Please advise whether this fixable and how?
>
> Thank you in advance!
>

Re: Inverse English an digits in Arabic Text

Reply via email to