2012/10/15 Mojca Miklavec <[email protected]>: > On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote: >> >> This is the nature of the PDF format. It is a preprint format the focuses on >> glyphs rather than characters >> >> It partly depends on the font, and the OT features being used. >> >> In theory you can have ActualText in the PDF, but once you move to complex >> scripts all bets are off. Without a complete rewrite of the PDF standard >> .... fidelity to the text is not really possible. PDF format wasn't designed >> to do it. > > I might be wrong, but pdfTeX-generated documents work fine (after > adding encoding vector) even though the glyphs populate "random" slots > is the font (for example T1 encoding) that have nothing to do with > Unicode. > It works with good fonts in good viewers because these "good fonts" assign proper names to the glyphs. I tested this many years ago not only in pdftex but also with tex + dvips + either ps2pdf from GS or Adobe Distiller.
> It should be possible to do something similar in XeTeX/LuaTeX. > > I'm not saying that this would solve problems of copy-pasting Arabic > scripts, but it should be possible to cover alternate glyphs for Latin > scripts at least. > > Mojca > > PS: From > http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html > > There is an optional auxiliary structure called the "ToUnicode" table > that was introduced into PDF to help with this text retrieval problem. > A ToUnicode table can be associated with a font that does not normally > have a way to determine the relationship between glyphs and Unicode > characters (some do). The table maps strings of glyph identifiers into > strings of Unicode characters, often just one to one, so that the > proper character strings can be made from the glyph references in the > file. > ToUnicode can only replace a byte with a sequence of bytes. Type1 font can encode only 256 characters, therefore such mapping is possible. Many years ago I developed a ToUnicode map for Velthuis Devanagari: http://icebearsoft.euweb.cz/dvngpdf/ Complex scripts would require many-to-many mapping but it is impossible with toUnicode. > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
