On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
> > You could use the various hacks
> > you've discussed, with modifier letters; but that is not "encoding",
> > that is "abusing Unicode to do markup". At least, that's the view I
> > take!
>
> +1

There seems to be a widespread confusion about what is plain text, and what 
Unicode is for. From an US-QWERTY point of view, a current mental 
representation 
of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with 
acute.
Unicode is granting to every language its plain text representation. If 
superscript
acts as abbreviation indicator in a given language, this is part of the plain 
text 
representation of that language. 

So far, so good. The core problem is now to determine whether superscript is 
mandatory, and baseline is fallback, or superscript is optional and decorative, 
and baseline is correct. That may be a matter of opinion, as has been 
suggested. 
However we know now a list of languages where superscript is mandatory, and 
baseline is fallback. Leaving English alone, these languages on themselves need 
the use of preformatted superscript letters being granted to them by the UTC.

Still in the beginning, when early Unicode set up the Standard, superscript
was ruled out of plain text, except when there was sort of a strong lobbying, 
like when Vietnamese precomposed letters were added. Phoneticists have a strong 
lobby, so they got some ranges of preformatted letters. To make sure nobody 
dare use them in running text elsewhere, all *new* superscript letters got 
names 
on a MODIFIER LETTER basis, while subscript letters got straightforward names 
having SUBSCRIPT in them. Additionally, strong caveats were published in TUS.

And the trick worked, as most of the time, one is now referring to the 
superscript 
letters using the “modifier letter” label that Unicode have decked them out 
with.

That is why, today, any discussion is at risk of being subject to strong biases 
when its result should allow some languages to use their traditional 
abbreviation 
indicators, in an already encoded and implemented form. Fortunately the front 
has 
begun to move, as CLDR TC have granted ordinal indicators to the French locale 
per v34. 

Ordinal indicators are one category of abbreviation indicators. Consistently, 
the
already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in 
titles
like "Sª", "Nª Sª", as found in the navigation pane of:
http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea

I’m not quite sure whether some people would still argue that that string isn’t 
understood differently from "Na Sa".

> In general, I have a certain sympathy for the position that there is no 
> universal
> answer for the dividing line between plain and styled text; there are some 
> texts
> where the conventional division of plain test and styling means that the plain
> text alone will become somewhat ambiguous.

That is why phonetics need preformatted super- and subscripts, and so do 
languages
relying on superscript as an abbreviation indicator.

> We know that for mathematics, a different dividing line meant that it is 
> possible
> to create an (almost) plain text version of many (if not most) mathematical
> texts; the conventions of that field are widely shared -- supporting a case 
> for
> allowing a standard encoding to support it.

Referring to Murray Sargent’s UnicodeMath, a Nearly Plain Text Encoding of 
Mathematics, 
https://www.unicode.org/notes/tn28/
is always a good point in this discussion. UnicodeMath uses the full range of 
superscript digits, because the range is full. It does not use superscript 
letters, 
because their range is not full. Hence if superscript digits had stopped at the 
legacy range "¹²³", only measurement units like the metric equivalents of sq ft 
and 
cb ft could be written with superscripts, and that is already allowed according 
to
TUS. I’m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. 
Anyway, 
since phonetics need a full range of superscript and subscript digits, these 
were 
added to Unicode, and therefore are used in UnicodeMath.

Likewise, phonetics need a nearly-full range of superscript letters, so these 
were 
added to Unicode, and therefore are used in the digital representation of 
natural 
languages.

> However, it stops short of 100% support for edge cases, as does the ordinary
> plain text when used for "normal" texts. I think, on balance, that is OK.

That is not clear as long as “ordinary plain text” is not defined for the 
purpose 
of this discussion. Since I have superscript small letters on live keys, and 
the 
superscript "ᵉ" even doubled on the same level as the digits (that it is used 
to 
transform into ordinals for most of them), my French keyboard layout driver 
allows 
the OS to output ordinary plain text consisting of various signs including 
superscript small Latin letters. 

Now is Unicode making a difference between “plain text” and “ordinary plain 
text”?
There are various ways to “clean up” the UCS, first removing presentation 
forms, 
then historic letters, then mathematical symbols, then why not emoji, and 
somewhere 
in-between, phonetic letters, among which superscripts. The result would then 
be 
“ordinary plain text” — but to what purpose? Possibly so that all documents 
must be 
written up using TeX. Following that logic to its end would mean that composed 
letters should be removed, too, given they are accurately represented using 
escape 
sequences like "e\'" for "é".

> If there were another important notational convention, widely shared, 
> reasonably consistent and so on, then I see no principled objection to 
> considering
> whether it should be supported (minus some edge cases) in its own form of
> plain text (with appropriate additional elements encoded).

I’m pleased to read that. Given the use of superscript in French is important, 
widely shared, and reasonably consistent, we need to know what it should be 
else. 
Certainly: supported by the local keyboard layout. Hopefully it will be, soon.

> The current case, transcribing a post-card to make the text searchable, for
> example, would fit the use case for ordinary plain text, with the warning 
> against
> simulated effects of markup.

Triggering such a warning would need to first sort out whether a given 
representation 
is best encoded using plain text or using markup. If it’s plain text, then that 
is 
not simulating anything. The reverse is true: Markup simulates accurate plain 
text.
Searchability is ensured by equivalence classes. Google Search has most 
comprehensive 
equivalence classes, indexing even all mathematical preformatted Latin letters 
like 
plain ASCII.

> All other uses are better served by markup, whether
> SGML / XML style to capture identified features, or final-form rich text like 
> PDF
> just preserving the appearance.

Agreed.

Best regards,

Marcel

Reply via email to