After reading a lot of documentation again, I've changed my mind about what I wrote before.
1) "It's only shorter than the one I could have if I write several blocks of beginbfchar/endbfchar." begincidrange/endcidrange is a short form to several begincidchar/endcidchar blocks. beginbfrange/endbfrange is the correct short form to several beginbfchar/endbfchar blocks. 2) "I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind of Identity mapping too, except for the 0x00...), isn't it?" It could be a valid CMap, but not for the text extraction purpose. Item 9.10.3 is clear when a CMap serves to this purpose: "It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding." 3) "If I've looked at the correct CMap file (fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a lot of blocks of beginbfchar/endbfchar. It doesn't have any beginbfchar/endbfchar block." The file has a lot of begincidrange/beginendrange blocks. In fact, it doesn't have any beginbfchar/endbfchar block. (conflicts with item 9.10.3...) About debugging the extraction text tool: 1) Identity resolution uses this codding pattern at PDFont.java to obtain the Unicode value: new String(new char[] { (char) code }) Something similar can be found at LegacyPDFStreamEngine.java My final thoughts: 1) Thank you Tilman for your help! 2) I think the tools that extract the "BCD" text could be partially ignoring the CMap (because it is invalid for text extraction - it doesn't contain beginbfchar/endbfchar or beginbfrange/endbfrange). So, maybe they don't try the five steps (letters "a" to "e") from item 9.10.2. Maybe their choice is the "identity" transformation for a failed Unicode production... "If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a PDF processor may choose a character code of their choosing." 3) I don't have any suggestions for a code change that could be a good solution. Maybe, I'll have to extract text for some thousands of PDFs like the "pag4_alt.pdf". In this case, I'll change the code with something like the file "identityForBadToUnicodeCMap.patch" that I've droped to the shared folder. Em sex., 15 de mar. de 2024 às 10:54, Tilman Hausherr <thaush...@t-online.de> escreveu: > Yes identity does work for that file. However using that logic fails to > provide the correct results for other files with an unusuable /ToUnicode > stream. > > Yes there can be larger blocks. > > My suspicion is that the tools who use "identity" for your file will > fail for some of the files. Unless we discover yet another tweak of that > workaround algorithm that works with all. > > Tilman > > On 15.03.2024 14:28, Luiz Marcelo Modesto wrote: > > Thank you Tilman! > > > > I'll try to read ISO 32000-2:2020 again to look for some kind of > precedence > > rules regarding the way of decoding string codes to Unicode chars. > > > > My impression is that there are some choices but I don't remember if > there > > is something assertive or not. Maybe it could be just an implementation > > choice. > > > > I'll try to debug the extraction text tool to verify why using the > > predefined Identity CMap works. > > > > If I've looked at the correct CMap file > > (fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a > > lot of blocks of beginbfchar/endbfchar. It doesn't have any > > beginbfchar/endbfchar block. > > > > All the blocks have their length limited to 256 codes, but it seems > PDFBox > > can support larger blocks. But, maybe the set "<0100> <FFFF> 256" could > be > > a problem... > > > > PS.: The use of "true" was just a fast and dirty way to do a fast test, > as > > the beginbfchar/endbfchar block suggested to me an identity mapping. > > > > > > > > > > Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr < > thaush...@t-online.de> > > escreveu: > > > >> You are correct that it's the "fb" parts that are missing. (And some of > >> the other tools you tried also mention this) > >> > >> Just adding true results in text extraction of several files no longer > >> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf > >> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf > >> > >> Adding "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings > >> no regressions but your text is not extracted properly. > >> > >> Maybe it is possible to include yet another rule for your file, but > >> there's likely more to do and there is the risk that other files no > >> longer extract properly. > >> > >> Tilman > >> > >> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote: > >>> It seems that PDFBOX-5540 resolves a special case based on some > >> dictionary > >>> properties and chooses a predefined CMap (Identity CMap). > >>> > >>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode > CMap > >>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream > >>> doesn't contain 1 or more blocks of beginbfchar/endbfchar. > >>> > >>> The two CMap's HashMaps (charToUnicodeOneByte and > charToUnicodeTwoBytes) > >>> are really empty. > >>> > >>> But the font CMap stream contains this block: > >>> > >>> 2 begincidrange > >>> <0001> <00FF> 1 > >>> <0100> <FFFF> 256 > >>> endcidrange > >>> > >>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a > >> kind > >>> of Identity mapping too, except for the 0x00...), isn't it? > >>> > >>> It's only shorter than the one I could have if I write several blocks > of > >>> beginbfchar/endbfchar. > >>> > >>> If I make this "dumb" modification (adding "true" to conditions) just > >> for a > >>> rapid test > >>> > >>> if (cmapName.contains("Identity") // > >>> || ordering.contains("Identity") // > >>> || COSName.IDENTITY_H.equals(encoding) // > >>> || COSName.IDENTITY_V.equals(encoding) || true) > >>> { > >>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING); > >>> if (true || encodingDict == null || !encodingDict.containsKey(COSName. > >>> DIFFERENCES)) > >>> { > >>> // assume that if encoding is identity, then the reverse is also true > >>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName()); > >>> LOG.warn("Using predefined identity CMap instead"); > >>> } > >>> } > >>> > >>> I've got "BCD" string like all the others > >>> > >>> The encoding parameter is ignored when writing to the console. > >>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont > >>> loadUnicodeCmap > >>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn > >>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont > >>> loadUnicodeCmap > >>> ADVERTÊNCIA: Using predefined identity CMap instead > >>> Página 4 de 4 > >>> Informações: BCD > >>> > >>> Maybe the extract text tool should been using begincidrange/endcidrange > >>> information... > >>> > >>> What do you think about? > >>> > >>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long. > >>> Maybe I'm missing something... I'm sorry if this is the case... > >>> > >>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto < > >>> lmodesto.w...@gmail.com> escreveu: > >>> > >>>> Ok! > >>>> > >>>> I'll read PDFBOX-5540 and related issues. > >>>> > >>>> Thank you very much! > >>>> > >>>> > >>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr < > thaush...@t-online.de > >>>> escreveu: > >>>> > >>>>> Hi, > >>>>> > >>>>> The problem is in the ToUnicode stream, there's a log message > "Invalid > >>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode > mappings. > >>>>> PDFBox is trying a fallback solution which turns out to be wrong. > This > >>>>> is related to PDFBOX-5540 and earlier related issues. > >>>>> > >>>>> Tilman > >>>>> > >>>>> > >>>>> > >>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote: > >>>>>> Hi Tilman! > >>>>>> > >>>>>> Thank you very much for your attention! > >>>>>> > >>>>>> You can find the file "p4_alt.pdf" in this folder > >>>>>> < > >> > https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing > >>>>>> . > >>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and > others. > >>>>>> > >>>>>> I'm sorry, I sent the pdf file as an attachment in my first > >>>>> message, > >>>>>> but I didn't know that it wouldn't work. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr < > >>>>> thaush...@t-online.de> > >>>>>> escreveu: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> please upload your file to a sharehoster. > >>>>>>> > >>>>>>> Tilman > >>>>>>> > >>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> I'm not sure if this is the same as FAQ "How come I am > getting > >>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"... > >>>>>>>> > >>>>>>>> I'm using PDFBox version 3.0.1 and OpenJDK Runtime > Environment > >>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). > >>>>>>>> > >>>>>>>> I'm trying to understand how this PDF chunk (from > p4_fix.pdf > >>>>>>> attached) > >>>>>>>> BT > >>>>>>>> /G1F7 6.0 Tf > >>>>>>>> 94.871 773.806 Td > >>>>>>>> <004200430044> Tj > >>>>>>>> ET > >>>>>>>> > >>>>>>>> becomes "BCD" on PDFBox Debugger (the same on qpdfview, > Adobe > >>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction > >> tool. > >>>>>>>> Using the Poppler pdftotext (version 22.02.0) gives me > "BCD" > >> too. > >>>>>>>> The renders that allow me to copy the text give me "BCD" > text. > >>>>>>>> > >>>>>>>> It seems that PDFBox extraction tool follows the item > "9.10.2 > >>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but > >> all > >>>>>>>> the others choose a different way. > >>>>>>>> > >>>>>>>> Could you help me to understand if there is a problem with > >> the > >>>>>>>> PDF file, with the renders or with the extract text tool? > >>>>>>>> > >>>>>>>> Thank you! > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>>>> > >>>>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>> > >>>>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >