John H. Jenkins wrote:

The basic idea is that "plain text" is the minimum amount of
information to process the given language in a "normal" way.

That's a bit vague. We don't normally "process" languages; we read texts. Whether font or color variation is essential for understanding really depends on the author's purposes and choices, not on language,

FOR
EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
ISN'T, AND DOING IT LOOKS WRONG.

I wouldn't say it looks wrong. Surely it is often typographically poor or just stupid, but it might be a consequence of technical limitations (there are still loads of systems that make no case distinction in texts, so in any relevant aspect, they are effectively "uppercase-only"), and all-caps English is quite understandable, though boring to read, provided that some precautions are made by writers.

We therefore have both upper- and
lower-case letters for English.

It's just a distinction that you _can_ (and usually do) make in plain text English. It's not an inherent distinction: all-caps English is still English, though poorly written by modern standards.

Arabic, on the other hand, absolutely must have some way of allowing
for different letter shapes in different contexts, or it looks just
wrong, so Arabic "plain text" must have facility to allow for that,
either by explicitly having different characters for the different
shapes the letters take, or by providing a default layout algorithm
that defines them.

But "layout algorithms" are not part of character encoding or part of the definition of "plain text". It's not OK to render plain text Arabic, encoded at logical level (i.e., letters encoded abstractly and not as contextual forms), in a simplistic manner that uses a one letter - one glyph model. But that's not part of the definition of "plain text" at all.

Yes, there are issues which end up being judgment calls, and it's
easy to come up with cases where you can't really capture the full
semantic intent of the author without what Unicode calls "rich text."

We don't need to invent contrived examples for that. Every time an author uses italics or bolding to make an essential point in emphasizing something he does something that cannot be captured in a plain version of the text. To make an even simpler point, if you insert an essential content image into a document you step outside the realm of plain text.

I don't see any better definition for "plain text" than a negative one: it is text without formatting, except to the extent that forced line breaks and the choice of alternative forms for a character (to the extent that such differences are encoded in the character code) can be considered as formatting. "Plain text", though apparently a very simple concept, is a very abstract one. I don't think you can explain the concept to your neighbor while standing on one foot, if at all.

Human writing did not originate as plain text, and at the surface level, it is never "plain text": it always has some specific physical appearance, and abstract "plain text" can only be found below the surface, as the underlying data format where only character identities (character numbers in a specific code) are encoded, with no reference to a particular rendering.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Reply via email to