I believe it is fairly natural to think of physical sequences of code units as representatives of equivalence classes corresponding to abstract strings. One of the most important such equivalences is canonical equivalence, though one might want to use some tailoring - in which case one would not have canonical equivalence.
Given this abstraction, it is natural to want to be able to reference substrings. To this end, one may define the substrings of an abstract string to be the equivalence classes containing a physical substring of physical string in the original string. (There is probably no need to restrict oneself to vaguely contiguous substrings.) For example, I might want to express U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (canonical decomposition <U+0061, U+0302>) as a substring of a string canonically equivalent to U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW, whose NFD equivalent is <U+0061, U+0323, U+0302>. This corresponds to a decomposition into a Vietnamese vowel (U+00E2) plus a Vietnamese tone mark (U+0323). Now, if the canonical decomposition of U+1EAD in UTF-8 is held as x = <61, CC, A3, CC, 82>, I might, adapting the boundary based notation, choose to specify the *abstract* substring <U+00E2> as the abstract substring x[0:1,3:5]. (This specifies code units 0, 3 and 4.) If I specify that extracted substrings always contain entire scalar values, I might, confusingly, abbreviate this notation to x[0, 3]. However, if U+1EAD is held as a single scalar value in UTF-8 string y = <E1, BA, AD>, I want a notation that says, 'Take the first and third components of the NFD equivalent of the scalar value held starting at offset 0', e.g. "y[0.1, 0.3]". Using my own notation is likely to cause confusion - are there any shared, workable schemes in use? I'm putting together a demonstration regular expression engine that works on 'traces' (see http://en.wikipedia.org/wiki/Trace_monoid for the definition of a 'trace') rather than strings, but for it to be 'useful' I see no reason to restrict it to searching text in NFD. I'm currently working on capturing subexpressions. My hope is that we will ultimately have regular expression engines that fully grok canonical equivalence. When RL2.1 in UTS#18 "Unicode Regular Expressions" last looked usable as a specification (Version 13, 2008), the requirement that "an implementation shall provide a mechanism for ensuring that all canonically equivalent literal characters match" was too weak for my desire as a user. Concatenation of expressions is rather more complicated for traces than for strings, though still within the scope of mathematical regular expressions, where 'regular' means recognisable by a finite automaton. There are issues with Unicode sets - does "K" match "\p{ASCII} & \p{block=Letterlike Symbols}"? (The simplest solution seems to be to exclude codepoints with singleton decompositions, such as U+212A KELVIN SIGN, from the set of scalar values in Unicode sets.) As an aside, I'd have liked to have added fully decomposed Unicode strings under canonical equivalence to the Wikipedia article as an example of traces, but I couldn't find a source. Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

