Re: Type 0 font - Text extraction X PDF Debugger

Luiz Marcelo Modesto Thu, 14 Mar 2024 16:09:21 -0700

It seems that PDFBOX-5540 resolves a special case based on some dictionary
properties and chooses a predefined CMap (Identity CMap).


Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100> <FFFF> 256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind
of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just for a
rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:

> Ok!
>
> I'll read PDFBOX-5540 and related issues.
>
> Thank you very much!
>
>
> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <thaush...@t-online.de>
> escreveu:
>
>> Hi,
>>
>> The problem is in the ToUnicode stream, there's a log message "Invalid
>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
>> PDFBox is trying a fallback solution which turns out to be wrong. This
>> is related to PDFBOX-5540 and earlier related issues.
>>
>> Tilman
>>
>>
>>
>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>> > Hi Tilman!
>> >
>> >      Thank you very much for your attention!
>> >
>> >      You can find the file "p4_alt.pdf" in this folder
>> > <
>> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>> >.
>> > "Extra infos.pdf" file shows some output from PDF Debugger and others.
>> >
>> >      I'm sorry, I sent the pdf file as an attachment in my first
>> message,
>> > but I didn't know that it wouldn't work.
>> >
>> >
>> >
>> > Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
>> thaush...@t-online.de>
>> > escreveu:
>> >
>> >> Hi,
>> >>
>> >> please upload your file to a sharehoster.
>> >>
>> >> Tilman
>> >>
>> >> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
>> >>> Hi everyone,
>> >>>
>> >>>      I'm not sure if this is the same as FAQ "How come I am getting
>> >>> gibberish(G38G43G36G51G5) when extracting text?"...
>> >>>
>> >>>      I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
>> >>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>> >>>
>> >>>      I'm trying to understand how this PDF chunk (from p4_fix.pdf
>> >> attached)
>> >>>    BT
>> >>>    /G1F7 6.0 Tf
>> >>>    94.871 773.806 Td
>> >>>    <004200430044> Tj
>> >>>    ET
>> >>>
>> >>>      becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
>> >>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
>> >>>
>> >>>      Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>> >>>
>> >>>      The renders that allow me to copy the text give me "BCD" text.
>> >>>
>> >>>      It seems that PDFBox extraction tool follows the item "9.10.2
>> >>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
>> >>> the others choose a different way.
>> >>>
>> >>>       Could you help me to understand if there is a problem with the
>> >>> PDF file, with the renders or with the extract text tool?
>> >>>
>> >>> Thank you!
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>
>>

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to