Am 19.07.2016 um 23:09 schrieb Ygor Mutti:
Yes, it helps. Thank you for the prompt answer!

I wonder why the string returned by getUnicode contains the separate chars
instead of the ligature. Is there some way I can configure PDFTextStripper
to decode it as it is in the PDF?

No, I don't know.

The reason that it is decoded the way it is is the CMap table, which looks like this and tells what to do with the codes in the PDF

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfchar
<1D> <0066006C>   <============ fl
<1E> <2212>
<1F> <00660069>    <=========== fi
(...)

1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i.

Think about it... if it would decode to the "fi" unicode character, you wouldn't be able to text-search for "Justificação" easily in an extracted text.

Tilman



On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <[email protected]>
wrote:

Am 19.07.2016 um 20:43 schrieb Ygor Mutti:
Hi!

The javadoc states that the TextPosition.getIndividualWidths() method
returns "An array that is the same length as the length of the string."
Here is a gist containing a test case in which this statement is false:
https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca
I'd say the javadoc is wrong. It is the length of the CharacterCodes
array, not the length of the unicode string. The "fi" in Justificação is
one glyph, a ligature.

This is the content stream:

[ (J) 20 (usti\037ca\347\343o) ] TJ

Does this explanation help?

Tilman

It prints a line for two cases where the TextPosition.getUnicode()
returns
"fi" while at the same time TextPosition,getIndividualWidths() returns an
array containing a single float.

I've tried to pin down the version in which this behavior has been
introduced and found out it works as expected in 1.2.1 release and does
not
since 1.3.0.

Should I open a ticket for this?


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to