2015-12-16 19:16 GMT+01:00 Doug Ewell <[email protected]>: > The ones you suggest are stateful; they affect the rendering of > arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48 > ("ANSI") attribute switching, or ISO 2022 character-set switching. > Unicode tries hard to avoid encoding such things.
You can try as hard as you want, there are cases where it is impossible to avoid stateful encoding if we want to avoid desunifications, or even for some characters that cannot even work without stateful analysis. And this is not solved just by style markup when that "style" is in fact completely semantic. The situation must be taken into account with more care : - For example, the superscript Latin letter o, aka "ordinal masculine", which is not just a superscript but a notation adding the semantics of a abbreviation for the final letters, linked to the other letters before it, the whole being semantically a single word: the superscript style does not create such attachment, it creates a separate "word" inside it, so it was disunified from the letter o. - But it is not a good practive to encode in Unicode things that are just styles without clear semantics (so encoding SUB/SUP is really a bad idea). - On the opposite it is simply impossible to work with Egyptian hieroglyphs as the default clusters are clearly insufficient to create ANY kind of plain-text: you need extra markup to add the necessary semantic, not style, and this markup should be encodable as plain-text without external markup for the presentation when this presenation is fully semantic and clear (e.g. the Egyptian "cartouche" for names of kings). - Similar issue occur with SingWriting and other scripts that DO require always a complex (non-linear) layout where basic clusters are clearly insufficient in ALL texts, meaning that the characters that were encoded are almost **useless** in all plain-text documents: you need extra "format" characters to create some form of orthographic rule, independantly of the style or from an external markup language. I'm in favor of adding **semantic** format characters in Unicode, not stylistic-only format characters, as soon as there does exist a wellknown orthographic convention which whould work independantly of styling. But for now the encoded format characters only work on too small clusters, clusters are only linear and this is clearly not enough (even for instructing other kinds of text analysis (such as breakers). Then the renderers will be adapted and extended to work with more complex clusters with their internal structures with simpler clusters parts). Other renderers using the legacy rules will not be able to do that but will attempt to render some basic fallback (possibly with special visible glyphs for those controls). One kind of semantic format character which is useful and encoded is the "invisible parentheses" for mathematics, which can be encoded for example after a radical sign: use them around a number to define the extension of the radical to more than one digit (and make a clear visual and semantic distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and "sqrt(2)+sqrt(3)").

