Ok! I'll read PDFBOX-5540 and related issues.
Thank you very much! Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <thaush...@t-online.de> escreveu: > Hi, > > The problem is in the ToUnicode stream, there's a log message "Invalid > ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. > PDFBox is trying a fallback solution which turns out to be wrong. This > is related to PDFBOX-5540 and earlier related issues. > > Tilman > > > > On 14.03.2024 13:28, Luiz Marcelo Modesto wrote: > > Hi Tilman! > > > > Thank you very much for your attention! > > > > You can find the file "p4_alt.pdf" in this folder > > < > https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing > >. > > "Extra infos.pdf" file shows some output from PDF Debugger and others. > > > > I'm sorry, I sent the pdf file as an attachment in my first message, > > but I didn't know that it wouldn't work. > > > > > > > > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr < > thaush...@t-online.de> > > escreveu: > > > >> Hi, > >> > >> please upload your file to a sharehoster. > >> > >> Tilman > >> > >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: > >>> Hi everyone, > >>> > >>> I'm not sure if this is the same as FAQ "How come I am getting > >>> gibberish(G38G43G36G51G5) when extracting text?"... > >>> > >>> I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment > >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). > >>> > >>> I'm trying to understand how this PDF chunk (from p4_fix.pdf > >> attached) > >>> BT > >>> /G1F7 6.0 Tf > >>> 94.871 773.806 Td > >>> <004200430044> Tj > >>> ET > >>> > >>> becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe > >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool. > >>> > >>> Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too. > >>> > >>> The renders that allow me to copy the text give me "BCD" text. > >>> > >>> It seems that PDFBox extraction tool follows the item "9.10.2 > >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all > >>> the others choose a different way. > >>> > >>> Could you help me to understand if there is a problem with the > >>> PDF file, with the renders or with the extract text tool? > >>> > >>> Thank you! > >>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >