Hi Tilman! Thank you very much for your attention!
You can find the file "p4_alt.pdf" in this folder <https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>. "Extra infos.pdf" file shows some output from PDF Debugger and others. I'm sorry, I sent the pdf file as an attachment in my first message, but I didn't know that it wouldn't work. Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <thaush...@t-online.de> escreveu: > Hi, > > please upload your file to a sharehoster. > > Tilman > > On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: > > Hi everyone, > > > > I'm not sure if this is the same as FAQ "How come I am getting > > gibberish(G38G43G36G51G5) when extracting text?"... > > > > I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment > > (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). > > > > I'm trying to understand how this PDF chunk (from p4_fix.pdf > attached) > > > > BT > > /G1F7 6.0 Tf > > 94.871 773.806 Td > > <004200430044> Tj > > ET > > > > becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe > > Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool. > > > > Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too. > > > > The renders that allow me to copy the text give me "BCD" text. > > > > It seems that PDFBox extraction tool follows the item "9.10.2 > > Mapping character codes to Unicode values" (ISO 32000-2:2020) but all > > the others choose a different way. > > > > Could you help me to understand if there is a problem with the > > PDF file, with the renders or with the extract text tool? > > > > Thank you! > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >