In the discussion leading up to this it has been implied that Unicode encodes / should encode concepts or pure shape. And there's been some confusion as to were concerns about sorting or legacy encodings fit in. Time to step back a bit:

Primarily the Unicode Standard encodes by character identity - something that is different from either the pure shape or the "concept denoted by the character".

For example, for most alphabetic characters, you could say that they stand for a more-or-less well-defined phonetic value. But Unicode does not encode such values directly, instead it encodes letters - which in turn get re-purposed for different sound values in each writing system.

Likewise, the various uses of period or comma are not separately encoded - potentially these marks are given mappings to specific functions for each writing system or notation using them.

Clearly these are not encoded to represent a single mapping to an external concept, and, as we will see, they are not necessarily encoded directly by shape.

Instead, the Unicode Standard encodes character identity; but there are a number of principled and some ad-hoc deviations from a purist implementation of that approach.

The first one is that of forcing a disunification by script. What constitutes a script can be argued over, especially as they all seem to have evolved from (or been created based on) predecessor scripts, so there are always pairs of scripts that have a lot in common. While an "Alpha" and an "A" do have much in common, it is best to recognize that their membership in different scripts leads to important differences so that it's not a stretch to say that they no longer share the same identity.

The next principled deviation is that of requiring case pairs to be unique. Bicameral scripts, (and some of the characters in them), acquired their lowercase at different times, so that the relation between the upper cases and the lower cases are different across scripts, and gives rise to some exceptional cases inside certain scripts.

This is one of the reasons to disunify certain bicameral scripts. But even inside scripts, there are case pairs that may share lowercase forms or may share uppercase forms, but said forms are disunified to make the pairs separate. The two first principles match users expectations in that case changes (largely) work as expected in plain text and that sorting also (largely) matches user expectation by default.

The third principle is to disunify characters based on line-breaking or line-layout properties. Implicit in that is the idea that plain text, and not markup, is the place to influence basic algorithms such as line-breaking and bidi layout (hence two sets of Arabic-Indic digits). Once can argue with that decision, but the fact is, there are too many places where text exist without the ability to apply markup to go entirely without that support.

The fourth principle is that of differential variability of appearance. For letters proper, their identity can be associated with a wide range of appearances from sparse to fanciful glyphs. If an entire piece of text (or even a given word) is set using a particular font style, context will enable the reader to identify the underlying letter, even if the shape is almost unrelated to the "archetypical shape" documented in the Standard.

When letters or marks get re-used in notational systems, though, the permissible range of variability changes dramatically - variations that do not change the meaning of a word in styled text, suddenly change the meaning of text in a certain notational system. Hence the disunification of certain letters or marks (but not all of them) in support of mathematical notation.

The fifth principle appears to be to disunify only as far as and only when necessary. The biggest downside of this principle is that it leads to "late" disunifications; some characters get disunified as the committee becomes aware of some issue, leading to the problem of legacy data. But it has usefully somewhat limited the further proliferation of characters of identical appearance.

The final principle is compatibility. This covers being able to round-trip from certain legacy encodings. This principle may force some disunifications that otherwise might not have happened, but it also isn't a panacea: there are legacy encodings that are mutually incompatible, so that one needs to make a choice which one to support. TeX being a "glyph based" system looses out here in comparison to legacy plain-text character encoding systems such as the 8859 series of ISO/IEC standards.

Some unification among punctuation marks in particular seem to have been made on a more ad-hoc basis. This issue is exacerbated by the fact that many such systems lack either the wide familiarity of standard writing systems (with their tolerance for glyph variation) nor the rigor of something like mathematical notation. This leads to the pragmatic choice of letting users select either "shape" or "concept" rather than "identity"; generally, such ad-hoc solutions should be resisted -- they are certainly not to be seen as a precedent for "encoding concepts" generally.

But such exceptions prove the rule, which leads back to where we started: the default position is that Unicode encodes a character identity that is not the same as encoding the concept that said character is used to represent in writing.

A./



Reply via email to