Re: Unicode characters unification

Asmus Freytag via Unicode Mon, 28 May 2018 21:42:40 -0700

In the discussion leading up to this it has been implied that Unicodeencodes / should encode concepts or pure shape. And there's been someconfusion as to were concerns about sorting or legacy encodings fit in.Time to step back a bit:

Primarily the Unicode Standard encodes by character identity - somethingthat is different from either the pure shape or the "concept denoted bythe character".

For example, for most alphabetic characters, you could say that theystand for a more-or-less well-defined phonetic value. But Unicode doesnot encode such values directly, instead it encodes letters - which inturn get re-purposed for different sound values in each writing system.

Likewise, the various uses of period or comma are not separately encoded- potentially these marks are given mappings to specific functions foreach writing system or notation using them.

Clearly these are not encoded to represent a single mapping to anexternal concept, and, as we will see, they are not necessarily encodeddirectly by shape.

Instead, the Unicode Standard encodes character identity; but there area number of principled and some ad-hoc deviations from a puristimplementation of that approach.

The first one is that of forcing a disunification by script. Whatconstitutes a script can be argued over, especially as they all seem tohave evolved from (or been created based on) predecessor scripts, sothere are always pairs of scripts that have a lot in common. While an"Alpha" and an "A" do have much in common, it is best to recognize thattheir membership in different scripts leads to important differences sothat it's not a stretch to say that they no longer share the same identity.

The next principled deviation is that of requiring case pairs to beunique. Bicameral scripts, (and some of the characters in them),acquired their lowercase at different times, so that the relationbetween the upper cases and the lower cases are different acrossscripts, and gives rise to some exceptional cases inside certain scripts.

This is one of the reasons to disunify certain bicameral scripts. Buteven inside scripts, there are case pairs that may share lowercase formsor may share uppercase forms, but said forms are disunified to make thepairs separate. The two first principles match users expectations inthat case changes (largely) work as expected in plain text and thatsorting also (largely) matches user expectation by default.

The third principle is to disunify characters based on line-breaking orline-layout properties. Implicit in that is the idea that plain text,and not markup, is the place to influence basic algorithms such asline-breaking and bidi layout (hence two sets of Arabic-Indic digits).Once can argue with that decision, but the fact is, there are too manyplaces where text exist without the ability to apply markup to goentirely without that support.

The fourth principle is that of differential variability of appearance.For letters proper, their identity can be associated with a wide rangeof appearances from sparse to fanciful glyphs. If an entire piece oftext (or even a given word) is set using a particular font style,context will enable the reader to identify the underlying letter, evenif the shape is almost unrelated to the "archetypical shape" documentedin the Standard.

When letters or marks get re-used in notational systems, though, thepermissible range of variability changes dramatically - variations thatdo not change the meaning of a word in styled text, suddenly change themeaning of text in a certain notational system. Hence the disunificationof certain letters or marks (but not all of them) in support ofmathematical notation.

The fifth principle appears to be to disunify only as far as and onlywhen necessary. The biggest downside of this principle is that it leadsto "late" disunifications; some characters get disunified as thecommittee becomes aware of some issue, leading to the problem of legacydata. But it has usefully somewhat limited the further proliferation ofcharacters of identical appearance.

The final principle is compatibility. This covers being able toround-trip from certain legacy encodings. This principle may force somedisunifications that otherwise might not have happened, but it alsoisn't a panacea: there are legacy encodings that are mutuallyincompatible, so that one needs to make a choice which one to support.TeX being a "glyph based" system looses out here in comparison to legacyplain-text character encoding systems such as the 8859 series of ISO/IECstandards.

Some unification among punctuation marks in particular seem to have beenmade on a more ad-hoc basis. This issue is exacerbated by the fact thatmany such systems lack either the wide familiarity of standard writingsystems (with their tolerance for glyph variation) nor the rigor ofsomething like mathematical notation. This leads to the pragmatic choiceof letting users select either "shape" or "concept" rather than"identity"; generally, such ad-hoc solutions should be resisted -- theyare certainly not to be seen as a precedent for "encoding concepts"generally.

But such exceptions prove the rule, which leads back to where westarted: the default position is that Unicode encodes a characteridentity that is not the same as encoding the concept that saidcharacter is used to represent in writing.

A./

Re: Unicode characters unification

Reply via email to