Re: Inverse English an digits in Arabic Text

2020-09-08 Thread adeq8

Thank you for support,

I upload PDF file page by page. And in this case left to right (LTR) or right 
to left (RTL) reading apples for the whole document not for the specific text 
block ( separate for Arabic, separate for Enlish)

I can see the same behavior with output for via  /select as well as /browse 
call 

Almost sure the problem is with during upload  
 

But adding this to the 
   and latter to another analyzer does not change the 
result.




Re: Inverse English an digits in Arabic Text

2020-09-08 Thread Alexandre Rafalovitch
If you are uploading a PDF, then you must be doing it via Tika or via
an extract handler (which uses Tika under the covers).

Try getting a standalone Tika of the same version and see what it
outputs. Perhaps there is something in those specific PDF pages that
confuse Tika. Like, if it used different font for English text and
therefore Adobe encoded each letter individually and therefore broke
the flow. PDF is not a content format, but presentation format. These
things happen.

Regards,
   Alex

On Tue, 8 Sep 2020 at 09:11,  wrote:
>
>
> Thank you for support,
>
> I upload PDF file page by page. And in this case left to right (LTR) or right 
> to left (RTL) reading apples for the whole document not for the specific text 
> block ( separate for Arabic, separate for Enlish)
>
> I can see the same behavior with output for via  /select as well as /browse 
> call
>
> Almost sure the problem is with during upload
> 
>
> But adding this to the
>and latter to another analyzer does not change the 
> result.
>
>


Re: Inverse English an digits in Arabic Text

2020-09-07 Thread Erick Erickson
A quick test would be to send some simple queries by curl
rather than the browser, that’ll avoid any rendering issues.

Second, take a look at the admin 
UI>>pick_a_collection_from_the_dropdown>>analysis 
page and look at the terms in the field in question. Do they look “ok”?l You’re
looking at what’s actually indexed at that point. The Terms Component let’s
you look at the indexed terms more powerfully too:
https://lucene.apache.org/solr/guide/7_3/the-terms-component.html

Finally, since it’s OK in one environment and not in another, it’s very likely 
not
an issue with Solr itself, but something different about the environments, 
especially
the indexing process. I doubt it’s a difference between Solr 4.1 and 4.2…

Best,
Erick

> On Sep 7, 2020, at 7:10 AM, Alexandre Rafalovitch  wrote:
> 
>> Doc in Arabic with some English - English text is inverted (for example,
> "gro.echapa.www"), what makes search by key words impossible.
> 
> What very specifically do you mean by that. How do you see the inversion?
> 
> If that's within some sort of web ui, then you are probably seeing the HTML
> bidi (bidirectional LTR/RTL) presentation issues.
> 
> And if you are seeing in in Cloudera UI, then the question may be for their
> forum.
> 
> One way to test is to have English text in brackets "(www.apache.org)"
> within Arabic flow. If you see again your issue but the brackets get weird
> "((gro.", this is most likely a bidi presentation issue with algorithm
> or HTML attribute set to RTL.
> 
> Could be something else though, but that would be a start point.
> 
> Regards,
>Alex
> 
> 
> On Mon., Sep. 7, 2020, 5:54 a.m. ,  wrote:
> 
>> Hi,
>> 
>> Could please help to resolve an issue. I upload/index several documents in
>> English and in Arabic languages to SOLR, in addition I use handler for
>> Arabic language:
>>  
>>   
>>
>>> words="stopwords.txt" enablePositionIncrements="true" />
>> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> > class="solr.ArabicNormalizationFilterFactory"/>
>>
>>
>> 
>>  
>>  
>>
>>> words="stopwords.txt" enablePositionIncrements="true" />
>>> ignoreCase="true" expand="true"/>
>> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  > class="solr.ArabicNormalizationFilterFactory"/>
>>
>>
>> 
>>  
>> 
>> There are two environments:
>> Local machine: - SOLR version: 4,2
>>- Windows version: 10
>> 
>> DEV env: - SOLR version 4.1 as part of the cloudera suit
>>- Linux core version: 3.10.0-862
>> 
>> Issue appears when uploading documents:
>> Local machine: - Doc in English with English words only -
>> ok (for example, "www.apache.org")
>>- Doc in Arabic with some English words - ok (for example,
>> "www.apache.org")
>> 
>> DEV env: - Doc in English with English words only - ok
>> (for example, "www.apache.org")
>>- Doc in Arabic with some English - English text is
>> inverted (for example, "gro.echapa.www"), what makes search by key words
>> impossible.
>> 
>> Please advise whether this fixable and how?
>> 
>> Thank you in advance!
>> 



Re: Inverse English an digits in Arabic Text

2020-09-07 Thread Alexandre Rafalovitch
> Doc in Arabic with some English - English text is inverted (for example,
"gro.echapa.www"), what makes search by key words impossible.

What very specifically do you mean by that. How do you see the inversion?

If that's within some sort of web ui, then you are probably seeing the HTML
bidi (bidirectional LTR/RTL) presentation issues.

And if you are seeing in in Cloudera UI, then the question may be for their
forum.

One way to test is to have English text in brackets "(www.apache.org)"
within Arabic flow. If you see again your issue but the brackets get weird
"((gro.", this is most likely a bidi presentation issue with algorithm
or HTML attribute set to RTL.

Could be something else though, but that would be a start point.

Regards,
Alex


On Mon., Sep. 7, 2020, 5:54 a.m. ,  wrote:

> Hi,
>
> Could please help to resolve an issue. I upload/index several documents in
> English and in Arabic languages to SOLR, in addition I use handler for
> Arabic language:
>   
>
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>   class="solr.RemoveDuplicatesTokenFilterFactory"/>
>   class="solr.ArabicNormalizationFilterFactory"/>
> 
> 
>
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  ignoreCase="true" expand="true"/>
>   class="solr.RemoveDuplicatesTokenFilterFactory"/>
>class="solr.ArabicNormalizationFilterFactory"/>
> 
> 
>
>   
>
> There are two environments:
> Local machine: - SOLR version: 4,2
> - Windows version: 10
>
> DEV env: - SOLR version 4.1 as part of the cloudera suit
> - Linux core version: 3.10.0-862
>
> Issue appears when uploading documents:
> Local machine: - Doc in English with English words only -
> ok (for example, "www.apache.org")
> - Doc in Arabic with some English words - ok (for example,
> "www.apache.org")
>
> DEV env: - Doc in English with English words only - ok
> (for example, "www.apache.org")
> - Doc in Arabic with some English - English text is
> inverted (for example, "gro.echapa.www"), what makes search by key words
> impossible.
>
> Please advise whether this fixable and how?
>
> Thank you in advance!
>