On Wed, 20 Mar 2013 20:49:32 -0600 Karl Williamson <[email protected]> wrote: > Now back to processing general text. Doing any serious analysis of > text will require using regular expressions. That means normalizing > the input, as UTS 18 finally now says.
I think that change may be associated with the fact that what were intended to be Unicode regular expressions are not in general regular expressions - strings canonically equivalent to (ab)* are not recognisable by finite state machines if a and b are indecomposable and have distinct non-zero canonical combining classes! > Whatever normalization you > choose, singleton decompositions are taken. > That means that ANO TELEIA becomes a MIDDLE DOT, and GREEK QUESTION > MARK (U+037E) becomes a SEMICOLON (U+003B), among other things. This > really presents a rather untenable situation for a program. You have > to normalize, but if you do, you lose critical information. For linguistic analysis, you need the normalisation appropriate to the task. This is a case where Unicode normalisation generally throws away information (namely, how the author views the characters), whereas in analysing Burmese you may want to ignore the order of non-interacting medial signs even though they have canonical combining class 0. I have found it useful to use a fake UnicodeData.txt to perform a non-Unicode normalisation using what were intended to be routines for performing Unicode normalisation. Fake decompositions are routinely added to the standard ones when generating the default collation weights for the Unicode Collation Algorithm - but there the results still comply with the principle of canonical equivalence. However, distinguishing U+00B7 and U+0387 would fail spectacularly of the text had been converted to form NFC before you received it. > Further, the code chart glyphs for the ANO TELEIA and the MIDDLE DOT > differ, see attachment. If they are canonically equivalent, and one > is a mandatory decomposition of the other, why do they have differing > glyphs? Because the codepoints are usually associated with different fonts? For a more striking example, compare the code chart glyphs for U+2F831, U+2F832 and U+2F833, which are all canonically equivalent to U+537F. Richard.

