> On 20 Jul 2016, at 14:33, Ygor Mutti <[email protected]> wrote: > > IMHO, the responsibilities are messed up in this case. > > I'm surprised to find out that Unicode deals with typographic sugar like > ligatures. This could be much more conveniently handled by the font using > separate glyphs.
Yes, indeed. There's only a handful of ligatures in Unicode for backwards compatibility with legacy systems. > Also, I think only text search algorithms, not PDF authoring tools, should > concern about searches using approximations. We already have to deal with > PDF authors that don't approximate uncommon glyphs, so we have to handle > them during text search anyway. I think you might be after the "compatibility decomposition" defined by Unicode. > I've solved the problem by determining the width of each character in the > Unicode string as the width of the ligature divided by the length of the > string. This is adequate for our purposes. That's the approach which is use for placing a caret inside a ligature, so it's a decent choice. -- John > Thank you, Tilman and John, for the help! > >> On Tue, Jul 19, 2016 at 6:48 PM John Hewson <[email protected]> wrote: >> >> >>> On 19 Jul 2016, at 14:28, Tilman Hausherr <[email protected]> wrote: >>> >>> Am 19.07.2016 um 23:09 schrieb Ygor Mutti: >>>> Yes, it helps. Thank you for the prompt answer! >>>> >>>> I wonder why the string returned by getUnicode contains the separate >> chars >>>> instead of the ligature. Is there some way I can configure >> PDFTextStripper >>>> to decode it as it is in the PDF? >>> >>> No, I don't know. >>> >>> The reason that it is decoded the way it is is the CMap table, which >> looks like this and tells what to do with the codes in the PDF >> >> You mean the ToUnicode CMap (that’s what’s below). The CMap is found in >> the Encoding entry and maps a character code to a CID. >> >>> >>> /CIDInit /ProcSet findresource begin >>> 12 dict begin >>> begincmap >>> /CIDSystemInfo >>> << /Registry (Adobe) >>> /Ordering (UCS) /Supplement 0 >> def >>> /CMapName /Adobe-Identity-UCS def >>> /CMapType 2 def >>> 1 begincodespacerange >>> <0000> <FFFF> >>> endcodespacerange >>> 100 beginbfchar >>> <1D> <0066006C> <============ fl >>> <1E> <2212> >>> <1F> <00660069> <=========== fi >>> (...) >>> >>> 1F = octal 037 decodes to 00660069 i.e. two unicode characters, f and i. >>> >>> Think about it... if it would decode to the "fi" unicode character, you >> wouldn't be able to text-search for "Justificação" easily in an extracted >> text. >> >> Indeed. The ToUnicode CMap in this PDF specifies that the the “fi” glyph >> represents “f” and “i” in Unicode. >> >> — John >> >>> Tilman >>> >>> >>>> >>>> On Tue, Jul 19, 2016 at 4:47 PM Tilman Hausherr <[email protected]> >>>> wrote: >>>> >>>>>> Am 19.07.2016 um 20:43 schrieb Ygor Mutti: >>>>>> Hi! >>>>>> >>>>>> The javadoc states that the TextPosition.getIndividualWidths() method >>>>>> returns "An array that is the same length as the length of the >> string." >>>>>> Here is a gist containing a test case in which this statement is >> false: >>>>>> https://gist.github.com/ygormutti/d40a80d425d552151625a063fb29c9ca >>>>> I'd say the javadoc is wrong. It is the length of the CharacterCodes >>>>> array, not the length of the unicode string. The "fi" in Justificação >> is >>>>> one glyph, a ligature. >>>>> >>>>> This is the content stream: >>>>> >>>>> [ (J) 20 (usti\037ca\347\343o) ] TJ >>>>> >>>>> Does this explanation help? >>>>> >>>>> Tilman >>>>> >>>>>> It prints a line for two cases where the TextPosition.getUnicode() >>>>> returns >>>>>> "fi" while at the same time TextPosition,getIndividualWidths() >> returns an >>>>>> array containing a single float. >>>>>> >>>>>> I've tried to pin down the version in which this behavior has been >>>>>> introduced and found out it works as expected in 1.2.1 release and >> does >>>>> not >>>>>> since 1.3.0. >>>>>> >>>>>> Should I open a ticket for this? >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] <mailto: >> [email protected]> >>> For additional commands, e-mail: [email protected] <mailto: >> [email protected]> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

