Re: Type 0 font - Text extraction X PDF Debugger
Am 25.03.24 um 10:07 schrieb Tilman Hausherr: On 25.03.2024 07:48, Andreas Lehmkühler wrote: Thanks for the URLs. All of them are working with my change. See https://issues.apache.org/jira/browse/PDFBOX-5790 for further details. @Tilman Please run your tests if possible No regressions 👍 Cool, thanks for the retest Tilman Andreas Am 24.03.24 um 16:39 schrieb Tilman Hausherr: Here they are, remove the XXX https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D The extension p1 / p3 means I split these files and used only one page for my own tests. Tilman On 24.03.2024 16:19, Andreas Lehmkühler wrote: Am 15.03.24 um 05:35 schrieb Tilman Hausherr: You are correct that it's the "fb" parts that are missing. (And some of the other tools you tried also mention this) Just adding true results in text extraction of several files no longer being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf I've found a solution which works with provided pdf and with PDFBOX-5540.pdf. @Tilman I guess the other files are from our test corpus? If so, were exactly can I find them? Andreas Adding "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings no regressions but your text is not extracted properly. Maybe it is possible to include yet another rule for your file, but there's likely more to do and there is the risk that other files no longer extract properly. Tilman On 15.03.2024 00:08, Luiz Marcelo Modesto wrote: It seems that PDFBOX-5540 resolves a special case based on some dictionary properties and chooses a predefined CMap (Identity CMap). Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream doesn't contain 1 or more blocks of beginbfchar/endbfchar. The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes) are really empty. But the font CMap stream contains this block: 2 begincidrange <0001> <00FF> 1 <0100> 256 endcidrange I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind of Identity mapping too, except for the 0x00...), isn't it? It's only shorter than the one I could have if I write several blocks of beginbfchar/endbfchar. If I make this "dumb" modification (adding "true" to conditions) just for a rapid test if (cmapName.contains("Identity") // || ordering.contains("Identity") // || COSName.IDENTITY_H.equals(encoding) // || COSName.IDENTITY_V.equals(encoding) || true) { COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING); if (true || encodingDict == null || !encodingDict.containsKey(COSName. DIFFERENCES)) { // assume that if encoding is identity, then the reverse is also true cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName()); LOG.warn("Using predefined identity CMap instead"); } } I've got "BCD" string like all the others The encoding parameter is ignored when writing to the console. mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Using predefined identity CMap instead Página 4 de 4 Informações: BCD Maybe the extract text tool should been using begincidrange/endcidrange information... What do you think about? PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long. Maybe I'm missing something... I'm sorry if this is the case... Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto < lmodesto.w...@gmail.com> escreveu: Ok! I'll read PDFBOX-5540 and related issues. Thank you very much! Em qui, 14 de mar de 2024 10:08, Tilman Hausherr escreveu: Hi, The problem is in the ToUnicode stream, there's a log message "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. PDFBox is trying a fallback solution which turns out to be wrong. This is related to PDFBOX-5540 and earlier related issues. Tilman On 14.03.2024 13:28, Luiz Marcelo Modesto wrote: Hi Tilman! Thank you very much for your attention! You can find the file "p4_alt.pdf" in this folder < https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing . "Extra infos.pdf" file shows some output from PDF Debugger and others. I'm sorry, I sent the pdf file as an attachment in my first message, but I didn't know that it wouldn't work. Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr < thaush...@t-online.de> escreveu: Hi, please upload your file to a sharehoster. Tilman On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: Hi everyone, I'm not sur
Re: Type 0 font - Text extraction X PDF Debugger
On 25.03.2024 07:48, Andreas Lehmkühler wrote: Thanks for the URLs. All of them are working with my change. See https://issues.apache.org/jira/browse/PDFBOX-5790 for further details. @Tilman Please run your tests if possible No regressions 👍 Tilman Andreas Am 24.03.24 um 16:39 schrieb Tilman Hausherr: Here they are, remove the XXX https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D The extension p1 / p3 means I split these files and used only one page for my own tests. Tilman On 24.03.2024 16:19, Andreas Lehmkühler wrote: Am 15.03.24 um 05:35 schrieb Tilman Hausherr: You are correct that it's the "fb" parts that are missing. (And some of the other tools you tried also mention this) Just adding true results in text extraction of several files no longer being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf I've found a solution which works with provided pdf and with PDFBOX-5540.pdf. @Tilman I guess the other files are from our test corpus? If so, were exactly can I find them? Andreas Adding "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings no regressions but your text is not extracted properly. Maybe it is possible to include yet another rule for your file, but there's likely more to do and there is the risk that other files no longer extract properly. Tilman On 15.03.2024 00:08, Luiz Marcelo Modesto wrote: It seems that PDFBOX-5540 resolves a special case based on some dictionary properties and chooses a predefined CMap (Identity CMap). Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream doesn't contain 1 or more blocks of beginbfchar/endbfchar. The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes) are really empty. But the font CMap stream contains this block: 2 begincidrange <0001> <00FF> 1 <0100> 256 endcidrange I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind of Identity mapping too, except for the 0x00...), isn't it? It's only shorter than the one I could have if I write several blocks of beginbfchar/endbfchar. If I make this "dumb" modification (adding "true" to conditions) just for a rapid test if (cmapName.contains("Identity") // || ordering.contains("Identity") // || COSName.IDENTITY_H.equals(encoding) // || COSName.IDENTITY_V.equals(encoding) || true) { COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING); if (true || encodingDict == null || !encodingDict.containsKey(COSName. DIFFERENCES)) { // assume that if encoding is identity, then the reverse is also true cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName()); LOG.warn("Using predefined identity CMap instead"); } } I've got "BCD" string like all the others The encoding parameter is ignored when writing to the console. mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap ADVERTÊNCIA: Using predefined identity CMap instead Página 4 de 4 Informações: BCD Maybe the extract text tool should been using begincidrange/endcidrange information... What do you think about? PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long. Maybe I'm missing something... I'm sorry if this is the case... Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto < lmodesto.w...@gmail.com> escreveu: Ok! I'll read PDFBOX-5540 and related issues. Thank you very much! Em qui, 14 de mar de 2024 10:08, Tilman Hausherr escreveu: Hi, The problem is in the ToUnicode stream, there's a log message "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. PDFBox is trying a fallback solution which turns out to be wrong. This is related to PDFBOX-5540 and earlier related issues. Tilman On 14.03.2024 13:28, Luiz Marcelo Modesto wrote: Hi Tilman! Thank you very much for your attention! You can find the file "p4_alt.pdf" in this folder < https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing . "Extra infos.pdf" file shows some output from PDF Debugger and others. I'm sorry, I sent the pdf file as an attachment in my first message, but I didn't know that it wouldn't work. Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr < thaush...@t-online.de> escreveu: Hi, please upload your file to a sharehoster. Tilman On 13.03.2024 20:03, Luiz Marcelo Modesto wrote: Hi everyone, I'm not sure if this is the same as FAQ "How come I am getting gibberish(G38G43G36G