On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote: > > This is the nature of the PDF format. It is a preprint format the focuses on > glyphs rather than characters > > It partly depends on the font, and the OT features being used. > > In theory you can have ActualText in the PDF, but once you move to complex > scripts all bets are off. Without a complete rewrite of the PDF standard > .... fidelity to the text is not really possible. PDF format wasn't designed > to do it.
I might be wrong, but pdfTeX-generated documents work fine (after adding encoding vector) even though the glyphs populate "random" slots is the font (for example T1 encoding) that have nothing to do with Unicode. It should be possible to do something similar in XeTeX/LuaTeX. I'm not saying that this would solve problems of copy-pasting Arabic scripts, but it should be possible to cover alternate glyphs for Latin scripts at least. Mojca PS: From http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html There is an optional auxiliary structure called the "ToUnicode" table that was introduced into PDF to help with this text retrieval problem. A ToUnicode table can be associated with a font that does not normally have a way to determine the relationship between glyphs and Unicode characters (some do). The table maps strings of glyph identifiers into strings of Unicode characters, often just one to one, so that the proper character strings can be made from the glyph references in the file. -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
