Re: Type 0 font - Text extraction X PDF Debugger

Tilman Hausherr Mon, 25 Mar 2024 02:08:59 -0700

On 25.03.2024 07:48, Andreas Lehmkühler wrote:

Thanks for the URLs. All of them are working with my change.
See https://issues.apache.org/jira/browse/PDFBOX-5790 for furtherdetails.
@Tilman Please run your tests if possible


No regressions 👍

Tilman

Andreas

Am 24.03.24 um 16:39 schrieb Tilman Hausherr:
Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVPhttps://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D
The extension p1 / p3 means I split these files and used only onepage for my own tests.
Tilman


On 24.03.2024 16:19, Andreas Lehmkühler wrote:
Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (Andsome of the other tools you tried also mention this)
Just adding true results in text extraction of several files nolonger being correct, 433525-p1.pdfO226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdfR4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and withPDFBOX-5540.pdf.
@Tilman I guess the other files are from our test corpus? If so,were exactly can I find them?
Andreas
Adding "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()"brings no regressions but your text is not extracted properly.
Maybe it is possible to include yet another rule for your file, butthere's likely more to do and there is the risk that other files nolonger extract properly.
Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on somedictionary
properties and chooses a predefined CMap (Identity CMap).
Reading the PDFont.java code, I think the warning "InvalidToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.
The two CMap's HashMaps (charToUnicodeOneByte andcharToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100> <FFFF> 256
endcidrange
I'm sorry if I misunderstood, but this is a valid CMap too (itseems a kind
of Identity mapping too, except for the 0x00...), isn't it?
It's only shorter than the one I could have if I write severalblocks of
beginbfchar/endbfchar.
If I make this "dumb" modification (adding "true" to conditions)just for a
rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null ||!encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD
Maybe the extract text tool should been usingbegincidrange/endcidrange
information...

What do you think about?
PS.: I've read some pieces from ISO 32000-2:2020 but it is quitelong.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:
Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!
Em qui, 14 de mar de 2024 10:08, Tilman Hausherr<thaush...@t-online.de>
escreveu:
Hi,
The problem is in the ToUnicode stream, there's a log message"InvalidToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicodemappings.PDFBox is trying a fallback solution which turns out to bewrong. This
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
Hi Tilman!

      Thank you very much for your attention!

      You can find the file "p4_alt.pdf" in this folder
<
https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
.
"Extra infos.pdf" file shows some output from PDF Debugger andothers.
      I'm sorry, I sent the pdf file as an attachment in my first
message,
but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
thaush...@t-online.de>
escreveu:
Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
Hi everyone,
I'm not sure if this is the same as FAQ "How come I amgetting
gibberish(G38G43G36G51G5) when extracting text?"...
I'm using PDFBox version 3.0.1 and OpenJDK RuntimeEnvironment
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
I'm trying to understand how this PDF chunk (fromp4_fix.pdf
attached)
    BT
    /G1F7 6.0 Tf
    94.871 773.806 Td
    <004200430044> Tj
    ET
becomes "BCD" on PDFBox Debugger (the same on qpdfview,AdobeReader, Chrome, ...) and becomes "abc" on PDFBox textextraction tool.
Using the Poppler pdftotext (version 22.02.0) gives me"BCD" too.
The renders that allow me to copy the text give me"BCD" text.
It seems that PDFBox extraction tool follows the item"9.10.2Mapping character codes to Unicode values" (ISO 32000-2:2020)but all
the others choose a different way.
Could you help me to understand if there is a problemwith the
PDF file, with the renders or with the extract text tool?

Thank you!
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to