On Fri, 22 Mar 2013 18:49:24 -0700 Asmus Freytag <[email protected]> wrote:
> On 3/22/2013 6:17 PM, Richard Wordingham wrote: > > On Fri, 22 Mar 2013 18:01:14 -0700 > > Asmus Freytag <[email protected]> wrote: > > > >>> On 03/21/2013 04:48 PM, Richard Wordingham wrote: > >>>> However, distinguishing U+00B7 and U+0387 would fail > >>>> spectacularly of the text had been converted to form NFC before > >>>> you received it. > >> That's a claim for which the evidence isn't yet solid and if it > >> could be made solid would make that claim very interesting. > > Distinguishing the character codes will fail trivially. Exactly. That is the point I made. > The question > is whether analysis or processing of the text will "fail > spectacularly". The latter is the true test of whether the > unification is "broken". I did not claim that such analysis should fail spectacularly. The root of the problem is that there are at least four uses of mid point which we can't yet say definitively are wrong: 1) Ano teleia; 2) Internal boundary in Catalan (actually, this is arguably wrong), Occitan and other languages; 3) Traditional British decimal point (not formally confirmed as acceptable, but common practice where technology has not suppressed this part of British culture); and 4) A phonetic symbol used for transliterating Tangut. There are other uses, but usually they reflect an origin in an 8-bit encoding. The character properties of U+00B7 have been crafted to support the first two, and I don't see any problem with further adjusting them to support the first three. Trailing decimal points may be interpreted as ano teleia, but semantically that's no worse than the handling of a trailing full stop in a number. Extending the properties to cover all four uses looks difficult, but then, the character properties of U+002E FULL STOP can't fully support all its uses. Usually one can tell the four uses apart, but not always. Greek and Tangut don't mix well, and hard line breaks can obscure the differences between uses (1) and (2). In most cases, it is known how a text uses U+00B7, but there might very well not be an interface for conveying such information to analysis software. Richard.

