Matt,

I hope, this is the information you need (from the README):

You get text that has the correct characters, but in the wrong order. This might be because you have not enabled sorting. The text in PDF files is stored in chunks and the chunks do not need to be stored in the order that they are displayed on a page. By default, PDFBox does not sort the text. Also, if you have text in a language that reads right to left (such as Arabic or Hebrew), make sure you have the ICU4J jar file in your classpath. This library is needed to properly handle right to left text.

Cheers,
Erik


Matthew Aguirre wrote:
Sorry if this get this twice, I accidentally sent this to the wrong list first.

I have been looking around and I saw where the issue with extracted Arabic words being written in reverse was fixed, but I'm seeing an issue where the extracted Arabic text of a sentence is in reverse. I assume this is due to Arabic being a left-to-right language. Is there anyway to detect this and have pdfbox extract the text in the correct order?

Expected Arabic Text:
??????? ?????? ?????? ??????? ??????? ??????

Returned Arabic Text:
?????? ?????? ??????? ?????? ????? ???????

I am using the latest version (0.8.0-incubating).
Is there something else that I am missing?
  • Arabic Text Matthew Aguirre
    • Re: Arabic Text Erik Scholtz, ArgonSoft GmbH

Reply via email to