Re: Type 0 font - Text extraction X PDF Debugger

Tilman Hausherr Thu, 14 Mar 2024 03:15:59 -0700

Hi,

please upload your file to a sharehoster.


Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,
I'm not sure if this is the same as FAQ "How come I am gettinggibberish(G38G43G36G51G5) when extracting text?"...
I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
    I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)

  BT
  /G1F7 6.0 Tf
  94.871 773.806 Td
  <004200430044> Tj
  ET
becomes "BCD" on PDFBox Debugger (the same on qpdfview, AdobeReader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
    Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

    The renders that allow me to copy the text give me "BCD" text.
It seems that PDFBox extraction tool follows the item "9.10.2Mapping character codes to Unicode values" (ISO 32000-2:2020) but allthe others choose a different way.
Could you help me to understand if there is a problem with thePDF file, with the renders or with the extract text tool?
Thank you!



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to