Richard,

the situation with the raised decimal point is a mess in Unicode.

I know that Mark thinks we have too many dots, but the reason this case is a mess is because the unification with U+002E is both non-workable in practice and runs counter to precedent.

The precedent in Unicode is to separately encode characters when they have different appearance, except, if, fundamentally, it's the "same" character and the difference in appearance can be determined unambiguously by "context".

There are two primary kinds of context that Unicode admits here. One is based on surrounding text (such as positional forms of Arabic letters). The other is overall stylistic context, such as a font choice (such as upright vs. slanted integral symbols).

When the appearance of a character is different based on the author's intent, and two (or more) different appearances can occur in the same document with different significance, then the usual response by Unicode has been to encode explicit characters. (The lot of phonetic characters are full of examples for this, like the lower case a without hook or the g with hook, both of which need to be distinguishable from other forms of these letters in phonetics).

So, if a British document can use both inline dots and raised dots, then you can't assign a single font to cover both. Well, the thought was, software might recognize the numeric context. However, as you've pointed out, section numbers are numeric and do not have the raised dot. In fact, as far as such documents are concerned, the raised dot itself can be used by the reader to distinguish decimal numbers from other use of numbers separated by dots (something not possible in other languages that lack this convention).

So, on the face of it, the choice to unify the raised decimal dot with 002E violates the encoding model, by pushing semantic distinctions into some kind of markup. On top of that, it's not really practical to expect to have to either mark up all decimal numbers or all section numbers with separate styles or font bindings. That's something not required anywhere else.

So far, that's bad enough.

Next, you have the issue that Unicode refused (quite properly) to encode a generic "decimal separator" character, the appearance of which was supposed to vary on external context (like locale or a document global style). This suggestion had been intended to allow numerical expressions to be cut and pasted between documents in different languages with all numbers formatted correctly w/o further editing. That is, the same character would appear as either comma or period (or raised period) depending on context.

I wrote that I agreed with the choice to not code such special character for that purpose. However, by not encoding a character for the raised decimal point, Unicode did an about-face and made 002E a "limited purpose" version of a "decimal separator". Suddenly, there is a character that is supposed to have different appearance based on context - on the line for US documents, off the line for British documents.

This directly violates the precedent established by the refusal to encode the generic "decimal separator".

What can be done?

I believe the Unicode Standard should be fixed by explicitly removing all suggestions in the text that the raised decimal point is unified with 002E.

Second, the standard should be amended by identifying which character is to be used instead for this purpose.

It might be something like 00B7. In that case, 00B7 would have to have properties that effectively produce the correct result in numeric context, while leaving non-numeric context unchanged. I believe that is entirely possible, and non-disruptive, insofar as numeric use of 00B7 does not exist for any purpose other than showing a raised decimal point (I suspect there are documents in the wild that already use this character for that purpose).

If that alternative is deemed not acceptable, the only remaining choice would be to add a new character. (I would recommend that only as the last resort).

A./


Reply via email to