Re: Type 0 font - Text extraction X PDF Debugger

Luiz Marcelo Modesto Sat, 16 Mar 2024 13:20:22 -0700

After reading a lot of documentation again, I've changed my mind about what
I wrote before.


1)  "It's only shorter than the one I could have if I write several blocks
of beginbfchar/endbfchar."

begincidrange/endcidrange is a short form to several
begincidchar/endcidchar blocks.

beginbfrange/endbfrange is the correct short form to several
beginbfchar/endbfchar blocks.

2) "I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
kind of Identity mapping too, except for the 0x00...), isn't it?"

It could be a valid CMap, but not for the text extraction purpose.

Item 9.10.3 is clear when a CMap serves to this purpose:

"It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange
operators to define the mapping from character codes to Unicode character
sequences expressed in UTF-16BE encoding."

3) "If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block."

The file has a lot of begincidrange/beginendrange blocks.

In fact, it doesn't have any beginbfchar/endbfchar block. (conflicts with
item 9.10.3...)

About debugging the extraction text tool:

1) Identity resolution uses this codding pattern at PDFont.java to obtain
the Unicode value:

new String(new char[] { (char) code })

Something similar can be found at LegacyPDFStreamEngine.java

My final thoughts:

1) Thank you Tilman for your help!

2) I think the tools that extract the "BCD" text could be partially
ignoring the CMap (because it is invalid for text extraction - it doesn't
contain beginbfchar/endbfchar or beginbfrange/endbfrange). So, maybe they
don't try the five steps (letters "a" to "e") from item 9.10.2. Maybe their
choice is the "identity" transformation for a failed Unicode production...

"If these methods fail to produce a Unicode value, there is no way to
determine what the character code represents in which case a PDF processor
may choose a character code of their choosing."

3) I don't have any suggestions for a code change that could be a good
solution. Maybe, I'll have to extract text for some thousands of PDFs like
the "pag4_alt.pdf". In this case, I'll change the code with something like
the file "identityForBadToUnicodeCMap.patch" that I've droped to the shared
folder.






Em sex., 15 de mar. de 2024 às 10:54, Tilman Hausherr <thaush...@t-online.de>
escreveu:

> Yes identity does work for that file. However using that logic fails to
> provide the correct results for other files with an unusuable /ToUnicode
> stream.
>
> Yes there can be larger blocks.
>
> My suspicion is that the tools who use "identity" for your file will
> fail for some of the files. Unless we discover yet another tweak of that
> workaround algorithm that works with all.
>
> Tilman
>
> On 15.03.2024 14:28, Luiz Marcelo Modesto wrote:
> > Thank you Tilman!
> >
> > I'll try to read ISO 32000-2:2020 again to look for some kind of
> precedence
> > rules regarding the way of decoding string codes to Unicode chars.
> >
> > My impression is that there are some choices but I don't remember if
> there
> > is something assertive or not. Maybe it could be just an implementation
> > choice.
> >
> > I'll try to debug the extraction text tool to verify why using the
> > predefined Identity CMap works.
> >
> > If I've looked at the correct CMap file
> > (fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
> > lot of blocks of beginbfchar/endbfchar. It doesn't have any
> > beginbfchar/endbfchar block.
> >
> > All the blocks have their length limited to 256 codes, but it seems
> PDFBox
> > can support larger blocks. But, maybe the set "<0100> <FFFF> 256" could
> be
> > a problem...
> >
> > PS.: The use of "true" was just a fast and dirty way to do a fast test,
> as
> > the beginbfchar/endbfchar block suggested to me an identity mapping.
> >
> >
> >
> >
> > Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr <
> thaush...@t-online.de>
> > escreveu:
> >
> >> You are correct that it's the "fb" parts that are missing. (And some of
> >> the other tools you tried also mention this)
> >>
> >> Just adding true results in text extraction of several files no longer
> >> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
> >> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
> >>
> >> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
> >> no regressions but your text is not extracted properly.
> >>
> >> Maybe it is possible to include yet another rule for your file, but
> >> there's likely more to do and there is the risk that other files no
> >> longer extract properly.
> >>
> >> Tilman
> >>
> >> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
> >>> It seems that PDFBOX-5540 resolves a special case based on some
> >> dictionary
> >>> properties and chooses a predefined CMap (Identity CMap).
> >>>
> >>> Reading the PDFont.java code, I think the warning "Invalid ToUnicode
> CMap
> >>> in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
> >>> doesn't contain 1 or more blocks of beginbfchar/endbfchar.
> >>>
> >>> The two CMap's HashMaps (charToUnicodeOneByte and
> charToUnicodeTwoBytes)
> >>> are really empty.
> >>>
> >>> But the font CMap stream contains this block:
> >>>
> >>> 2 begincidrange
> >>> <0001> <00FF> 1
> >>> <0100> <FFFF> 256
> >>> endcidrange
> >>>
> >>> I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
> >> kind
> >>> of Identity mapping too, except for the 0x00...), isn't it?
> >>>
> >>> It's only shorter than the one I could have if I write several blocks
> of
> >>> beginbfchar/endbfchar.
> >>>
> >>> If I make this "dumb" modification (adding "true" to conditions) just
> >> for a
> >>> rapid test
> >>>
> >>> if (cmapName.contains("Identity") //
> >>> || ordering.contains("Identity") //
> >>> || COSName.IDENTITY_H.equals(encoding) //
> >>> || COSName.IDENTITY_V.equals(encoding) || true)
> >>> {
> >>> COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
> >>> if (true || encodingDict == null || !encodingDict.containsKey(COSName.
> >>> DIFFERENCES))
> >>> {
> >>> // assume that if encoding is identity, then the reverse is also true
> >>> cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
> >>> LOG.warn("Using predefined identity CMap instead");
> >>> }
> >>> }
> >>>
> >>> I've got "BCD" string like all the others
> >>>
> >>> The encoding parameter is ignored when writing to the console.
> >>> mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
> >>> loadUnicodeCmap
> >>> ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
> >>> mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
> >>> loadUnicodeCmap
> >>> ADVERTÊNCIA: Using predefined identity CMap instead
> >>> Página 4 de 4
> >>> Informações:  BCD
> >>>
> >>> Maybe the extract text tool should been using begincidrange/endcidrange
> >>> information...
> >>>
> >>> What do you think about?
> >>>
> >>> PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
> >>> Maybe I'm missing something... I'm sorry if this is the case...
> >>>
> >>> Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
> >>> lmodesto.w...@gmail.com> escreveu:
> >>>
> >>>> Ok!
> >>>>
> >>>> I'll read PDFBOX-5540 and related issues.
> >>>>
> >>>> Thank you very much!
> >>>>
> >>>>
> >>>> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr <
> thaush...@t-online.de
> >>>> escreveu:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> The problem is in the ToUnicode stream, there's a log message
> "Invalid
> >>>>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode
> mappings.
> >>>>> PDFBox is trying a fallback solution which turns out to be wrong.
> This
> >>>>> is related to PDFBOX-5540 and earlier related issues.
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
> >>>>>> Hi Tilman!
> >>>>>>
> >>>>>>        Thank you very much for your attention!
> >>>>>>
> >>>>>>        You can find the file "p4_alt.pdf" in this folder
> >>>>>> <
> >>
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
> >>>>>> .
> >>>>>> "Extra infos.pdf" file shows some output from PDF Debugger and
> others.
> >>>>>>
> >>>>>>        I'm sorry, I sent the pdf file as an attachment in my first
> >>>>> message,
> >>>>>> but I didn't know that it wouldn't work.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <
> >>>>> thaush...@t-online.de>
> >>>>>> escreveu:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> please upload your file to a sharehoster.
> >>>>>>>
> >>>>>>> Tilman
> >>>>>>>
> >>>>>>> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>>        I'm not sure if this is the same as FAQ "How come I am
> getting
> >>>>>>>> gibberish(G38G43G36G51G5) when extracting text?"...
> >>>>>>>>
> >>>>>>>>        I'm using PDFBox version 3.0.1 and OpenJDK Runtime
> Environment
> >>>>>>>> (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >>>>>>>>
> >>>>>>>>        I'm trying to understand how this PDF chunk (from
> p4_fix.pdf
> >>>>>>> attached)
> >>>>>>>>      BT
> >>>>>>>>      /G1F7 6.0 Tf
> >>>>>>>>      94.871 773.806 Td
> >>>>>>>>      <004200430044> Tj
> >>>>>>>>      ET
> >>>>>>>>
> >>>>>>>>        becomes "BCD" on PDFBox Debugger (the same on qpdfview,
> Adobe
> >>>>>>>> Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction
> >> tool.
> >>>>>>>>        Using the Poppler pdftotext (version 22.02.0) gives me
> "BCD"
> >> too.
> >>>>>>>>        The renders that allow me to copy the text give me "BCD"
> text.
> >>>>>>>>
> >>>>>>>>        It seems that PDFBox extraction tool follows the item
> "9.10.2
> >>>>>>>> Mapping character codes to Unicode values" (ISO 32000-2:2020) but
> >> all
> >>>>>>>> the others choose a different way.
> >>>>>>>>
> >>>>>>>>         Could you help me to understand if there is a problem with
> >> the
> >>>>>>>> PDF file, with the renders or with the extract text tool?
> >>>>>>>>
> >>>>>>>> Thank you!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>>>>
> >>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>>
> >>>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Type 0 font - Text extraction X PDF Debugger

Reply via email to