On 11/5/2024 12:31 PM, Phil Smith III via Unicode wrote:

I assume you’ve seen https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts, which discusses what is and isn’t available as super/subscripts (henceforth “ss”) in Unicode. That surprised me—I would have thought that ss were markup, not characters, so there’s more of it implemented already than I’d expected.

The consensus that emerged over the first several decades of encoding Unicode treats these forms somewhat ambiguously.

In mathematical notation, any character can be a super or subscript, and so you find multiple scripts and symbols, but with not limit, in principle as to what additional characters some specialty may adopt and super/subscript for some purpose. And you have things like subscripts on subscripts and similarly complex layouts. In that context it is definitely appropriate to treat subscripting as a generic operation and to not try to encode some subset of possible results of that operation. You could never encode all forms that are ever used (or available for use) in mathematical notation, so for that purpose, encoding any further explicit subscript forms doesn't help.

There is generic use of (mostly) superscript numbers in text, for things like footnotes. These are also best done as generic operations (via styles), particularly as they relate to document structure that already suggests the use of plain text.

There are other notations, mainly phonetic, that have super/subscript forms but do not//need recursive subscripting or all the other interesting features of mathematical layout and formatting. In many of them, the super or subscript form often acts pretty much like any other letter in the notation, except for its shape. Common to these notations is that there's a fixed set of such shapes; they don't even cover a full basic alphabet; (that Unicode is getting close to having a full alphabet is from overlapping use).

For these cases there's a benefit in being able to have a robust plain text representation, so that "words" aren't required to use styling to be understood. That's the driving case behind encoding these forms. Ultimately the realization was that a universal character encoding could not be "one-size-fits-all" when it comes to serve wildly diverging styles of usage.

Another example of this dichotomy again involves the distinction between mathematics and text. In text, the plain text does not carry font information and it is fully acceptable to render the result in any font that supports the letters in question. That even goes for styles that aren't fully readable to everyday users. For example, text in the Latin script can be rendered using a Fraktur font that many people may have difficulties deciphering or reading fluently. No matter, you haven't changed the meaning of the text by doing that. And the selection of possible fonts is near infinite. Some font variations are generic enough that they can be applied to many scripts, others may be limited in practice to some specific alphabet.

In math notation, you have the situation that mathematicians have used the contrast between different font shapes to carry meaning. In some conventions, Fraktur shapes are used to indicate that a variable is a vector and not a scalar, for example. There are a handful of font styles that are used in this way, a fairly fixed set, and usually covering a limited set of characters as well. Because the operation is not fully generic, it is possible to cover it with explicitly encoded characters. At that point, there's the benefit of preserving that distinction in plain text.

In fact, it's possible this way, to render a very large subset of mathematical notation in an (almost) plain text form. Incidentally something not that dissimilar from the concept of markdown, a plain text stream with a few chosen conventions, in the math case, about the use of parens, plus dedicating some character to function as subscript and superscript "operator". (All the other math operators, such as integrals or radical signs, trigger their own formatting, thus obviating the need for encoding that explicitly).

Having the character for all shape variants used for variables encoded directly makes this near plaintext form very powerful. Again, what is a useful generic situation for ordinary text isn't as workable for a notational system and vice versa. They emerging insight was that Unicode should strive to make reasonable accommodations, but in a way that focused on the central needs for and features of each of them.

If you look just at the encoding though, you come away with a sense of apparent duplication and also seeming incompleteness: the additions for phonetic notations will never cover the generic use of math, while the few styled alphabets for math do nothing for general text use. The key is to recognize which notation or use case is supported by what, and then things make a whole lot more sense.

A./

Reply via email to