Re: Type 0 font - Text extraction X PDF Debugger

Luiz Marcelo Modesto Thu, 14 Mar 2024 05:28:35 -0700

Hi Tilman!

    Thank you very much for your attention!


    You can find the file "p4_alt.pdf" in this folder
<https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

    I'm sorry, I sent the pdf file as an attachment in my first message,
but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <thaush...@t-online.de>
escreveu:

> Hi,
>
> please upload your file to a sharehoster.
>
> Tilman
>
> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> > Hi everyone,
> >
> >     I'm not sure if this is the same as FAQ "How come I am getting
> > gibberish(G38G43G36G51G5) when extracting text?"...
> >
> >     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> > (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >
> >     I'm trying to understand how this PDF chunk (from p4_fix.pdf
> attached)
> >
> >   BT
> >   /G1F7 6.0 Tf
> >   94.871 773.806 Td
> >   <004200430044> Tj
> >   ET
> >
> >     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> > Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> >
> >     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> >
> >     The renders that allow me to copy the text give me "BCD" text.
> >
> >     It seems that PDFBox extraction tool follows the item "9.10.2
> > Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
> > the others choose a different way.
> >
> >      Could you help me to understand if there is a problem with the
> > PDF file, with the renders or with the extract text tool?
> >
> > Thank you!
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to