Re: Type 0 font - Text extraction X PDF Debugger

Luiz Marcelo Modesto Thu, 14 Mar 2024 06:31:22 -0700

Ok!

I'll read PDFBOX-5540 and related issues.


Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <thaush...@t-online.de>
escreveu:

> Hi,
>
> The problem is in the ToUnicode stream, there's a log message "Invalid
> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
> PDFBox is trying a fallback solution which turns out to be wrong. This
> is related to PDFBOX-5540 and earlier related issues.
>
> Tilman
>
>
>
> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> > Hi Tilman!
> >
> >      Thank you very much for your attention!
> >
> >      You can find the file "p4_alt.pdf" in this folder
> > <
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
> >.
> > "Extra infos.pdf" file shows some output from PDF Debugger and others.
> >
> >      I'm sorry, I sent the pdf file as an attachment in my first message,
> > but I didn't know that it wouldn't work.
> >
> >
> >
> > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
> thaush...@t-online.de>
> > escreveu:
> >
> >> Hi,
> >>
> >> please upload your file to a sharehoster.
> >>
> >> Tilman
> >>
> >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> >>> Hi everyone,
> >>>
> >>>      I'm not sure if this is the same as FAQ "How come I am getting
> >>> gibberish(G38G43G36G51G5) when extracting text?"...
> >>>
> >>>      I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >>>
> >>>      I'm trying to understand how this PDF chunk (from p4_fix.pdf
> >> attached)
> >>>    BT
> >>>    /G1F7 6.0 Tf
> >>>    94.871 773.806 Td
> >>>    <004200430044> Tj
> >>>    ET
> >>>
> >>>      becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> >>>
> >>>      Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> >>>
> >>>      The renders that allow me to copy the text give me "BCD" text.
> >>>
> >>>      It seems that PDFBox extraction tool follows the item "9.10.2
> >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
> >>> the others choose a different way.
> >>>
> >>>       Could you help me to understand if there is a problem with the
> >>> PDF file, with the renders or with the extract text tool?
> >>>
> >>> Thank you!
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to