Re: Arabic Text

Erik Scholtz, ArgonSoft GmbH Wed, 27 Jan 2010 00:07:32 -0800

Matt,

I hope, this is the information you need (from the README):

You get text that has the correct characters, but in the wrong order.This might be because you have not enabled sorting. The text in PDFfiles is stored in chunks and the chunks do not need to be stored in theorder that they are displayed on a page. By default, PDFBox does notsort the text. Also, if you have text in a language that reads right toleft (such as Arabic or Hebrew), make sure you have the ICU4J jar filein your classpath. This library is needed to properly handle right toleft text.


Cheers,
Erik


Matthew Aguirre wrote:

Sorry if this get this twice, I accidentally sent this to the wrong listfirst.
I have been looking around and I saw where the issue with extractedArabic words being written in reverse was fixed, but I'm seeing an issuewhere the extracted Arabic text of a sentence is in reverse. I assumethis is due to Arabic being a left-to-right language. Is there anyway todetect this and have pdfbox extract the text in the correct order?
Expected Arabic Text:
??????? ?????? ?????? ??????? ??????? ??????

Returned Arabic Text:
?????? ?????? ??????? ?????? ????? ???????

I am using the latest version (0.8.0-incubating).
Is there something else that I am missing?

Re: Arabic Text

Reply via email to