Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

Jukka K. Korpela Mon, 09 Aug 2010 11:32:57 -0700

John H. Jenkins wrote:

The basic idea is that "plain text" is the minimum amount of
information to process the given language in a "normal" way.

That's a bit vague. We don't normally "process" languages; we read texts.Whether font or color variation is essential for understanding reallydepends on the author's purposes and choices, not on language,

FOR
EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
ISN'T, AND DOING IT LOOKS WRONG.

I wouldn't say it looks wrong. Surely it is often typographically poor orjust stupid, but it might be a consequence of technical limitations (thereare still loads of systems that make no case distinction in texts, so in anyrelevant aspect, they are effectively "uppercase-only"), and all-capsEnglish is quite understandable, though boring to read, provided that someprecautions are made by writers.

We therefore have both upper- and
lower-case letters for English.

It's just a distinction that you _can_ (and usually do) make in plain textEnglish. It's not an inherent distinction: all-caps English is stillEnglish, though poorly written by modern standards.

Arabic, on the other hand, absolutely must have some way of allowing
for different letter shapes in different contexts, or it looks just
wrong, so Arabic "plain text" must have facility to allow for that,
either by explicitly having different characters for the different
shapes the letters take, or by providing a default layout algorithm
that defines them.

But "layout algorithms" are not part of character encoding or part of thedefinition of "plain text". It's not OK to render plain text Arabic, encodedat logical level (i.e., letters encoded abstractly and not as contextualforms), in a simplistic manner that uses a one letter - one glyph model. Butthat's not part of the definition of "plain text" at all.

Yes, there are issues which end up being judgment calls, and it's
easy to come up with cases where you can't really capture the full
semantic intent of the author without what Unicode calls "rich text."

We don't need to invent contrived examples for that. Every time an authoruses italics or bolding to make an essential point in emphasizing somethinghe does something that cannot be captured in a plain version of the text. Tomake an even simpler point, if you insert an essential content image into adocument you step outside the realm of plain text.

I don't see any better definition for "plain text" than a negative one: itis text without formatting, except to the extent that forced line breaks andthe choice of alternative forms for a character (to the extent that suchdifferences are encoded in the character code) can be considered asformatting. "Plain text", though apparently a very simple concept, is a veryabstract one. I don't think you can explain the concept to your neighborwhile standing on one foot, if at all.

Human writing did not originate as plain text, and at the surface level, itis never "plain text": it always has some specific physical appearance, andabstract "plain text" can only be found below the surface, as the underlyingdata format where only character identities (character numbers in a specificcode) are encoded, with no reference to a particular rendering.

--

Yucca, http://www.cs.tut.fi/~jkorpela/

Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

Reply via email to