It seems that PDFBOX-5540 resolves a special case based on some dictionary properties and chooses a predefined CMap (Identity CMap).
Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream doesn't contain 1 or more blocks of beginbfchar/endbfchar. The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes) are really empty. But the font CMap stream contains this block: 2 begincidrange <0001> <00FF> 1 <0100> <FFFF> 256 endcidrange I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind of Identity mapping too, except for the 0x00...), isn't it? It's only shorter than the one I could have if I write several blocks of beginbfchar/endbfchar. If I make this "dumb" modification (adding "true" to conditions) just for a rapid test if (cmapName.contains("Identity") // || ordering.contains("Identity") // || COSName.IDENTITY_H.equals(encoding) // || COSName.IDENTITY_V.equals(encoding) || true) { COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING); if (true || encodingDict == null || !encodingDict.containsKey(COSName. DIFFERENCES)) { // assume that if encoding is identity, then the reverse is also true cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName()); LOG.warn("Using predefined identity CMap instead"); } } I've got "BCD" string like all the others The encoding parameter is ignored when writing to the console. mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Using predefined identity CMap instead Página 4 de 4 Informações: BCD Maybe the extract text tool should been using begincidrange/endcidrange information... What do you think about? PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long. Maybe I'm missing something... I'm sorry if this is the case... Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto < lmodesto.w...@gmail.com> escreveu: > Ok! > > I'll read PDFBOX-5540 and related issues. > > Thank you very much! > > > Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <thaush...@t-online.de> > escreveu: > >> Hi, >> >> The problem is in the ToUnicode stream, there's a log message "Invalid >> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. >> PDFBox is trying a fallback solution which turns out to be wrong. This >> is related to PDFBOX-5540 and earlier related issues. >> >> Tilman >> >> >> >> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote: >> > Hi Tilman! >> > >> > Thank you very much for your attention! >> > >> > You can find the file "p4_alt.pdf" in this folder >> > < >> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing >> >. >> > "Extra infos.pdf" file shows some output from PDF Debugger and others. >> > >> > I'm sorry, I sent the pdf file as an attachment in my first >> message, >> > but I didn't know that it wouldn't work. >> > >> > >> > >> > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr < >> thaush...@t-online.de> >> > escreveu: >> > >> >> Hi, >> >> >> >> please upload your file to a sharehoster. >> >> >> >> Tilman >> >> >> >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: >> >>> Hi everyone, >> >>> >> >>> I'm not sure if this is the same as FAQ "How come I am getting >> >>> gibberish(G38G43G36G51G5) when extracting text?"... >> >>> >> >>> I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment >> >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). >> >>> >> >>> I'm trying to understand how this PDF chunk (from p4_fix.pdf >> >> attached) >> >>> BT >> >>> /G1F7 6.0 Tf >> >>> 94.871 773.806 Td >> >>> <004200430044> Tj >> >>> ET >> >>> >> >>> becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe >> >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool. >> >>> >> >>> Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too. >> >>> >> >>> The renders that allow me to copy the text give me "BCD" text. >> >>> >> >>> It seems that PDFBox extraction tool follows the item "9.10.2 >> >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all >> >>> the others choose a different way. >> >>> >> >>> Could you help me to understand if there is a problem with the >> >>> PDF file, with the renders or with the extract text tool? >> >>> >> >>> Thank you! >> >>> >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >>