2011/11/18 maxwell <[email protected]>: > On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner > <[email protected]> > wrote: >> 2011/11/18 Philip TAYLOR <[email protected]>: >>> Is it safe to assume that these "code listings" >>> are restricted to the ASCII character set ? If >>> so, yes, spaces are likely to be a problem, but >>> if the code listing can also include ligature- >>> digraphs, then these are likely to prove even >>> more problematic. >>> >> If the code listing is typeset in a fixed width font, it is usually no >> problem. I copied a few code samples from books in PDF, most of them >> were typeset by TeX. If I want to copy text in Devanagari, it is >> almost impossible. > > Besides TeX, Dr. Knuth also invented Literate Programming. In our own > project, we use LP to extract the code listings from the original source > code, rather than from the PDF. One advantage is that in addition to the > re-ordering at the character level (mentioned in part of Zdenek's email > that I didn't copy over), this allows re-ordering at any arbitrary level,
This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as a glyph) it precedes the consonant group after which it is pronounced. The sample word was kitab (it means a book). In Unicode (as characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually I-matra precedes KA. XeTeX (knowing that it works with a Devanagari script) runs the character sequence through ICU and the result is the glyph sequence. The original sequence is lost so that when the text is copied from PDF, we get (not exactly) i*katab. Microsoft suggested what additional characters should appear in Indic OpenType fonts. One of them is a dotted ring which denotes a missing consonant. I-matra must always follow a consonant (in character order). If it is moved to the beginning of a word, it is wrong. If you paste it to a text editor, the OpenType rendering engine should display a missing consonant as a dotted ring (if it is present in the font). In character order the dotted ring will precede I-matra but in visual (glyph) order it will be just opposite. Thus the asterisk shows the place where you will see the dotted circle. This is just one simple case. I-matra may follow a consonant group, such as in word PRIY (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women) which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both words will start with the I-matra glyph. The latter will contain two ordering bugs after copy&paste. Consider also word MURTI (statue) which is a sequence of characters MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will appear as an accent below the MA glyph. The next glyph will be I-matra followed by TA followed by RA shown as an upper accent at the right edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA glyph appears at the end of the syllable although locically (in character order) it belongs to the beginning. These cases cannot be solved by toUnicode map because many-to-many mappings are not allowed. Moreover, a huge amount of mappings will be needed. It would be better to do the reverse processing independent of toUnicode mappings, to use ICU or Pango or Uniscribe or whatever to analyze the glyphs and convert them to characters. The rules are unambiguous but AR does not do it. We discuss nonbreakable spaces while we are not yet able to convert properly printable glyphs to characters when doing copy&paste from PDF... -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
