On Tuesday, April 2, 2002, at 01:24 , Nick Ing-Simmons wrote: > Dan Kogai <[EMAIL PROTECTED]> writes: >>>> >>>> I don't like the <UNNNN+UMMMM> part it will make the parsing messier. >>>> >>>> The \xYY\xYY is of course what I meant ;-) >>> >>> Not that much. It's just a regex after all. > > For _perl_ it is but if we are going to get IBM's ICU or others > to back-port it then it is better to keep things clean.
Point well taken. > So let us have yacc-like: > > from : codepoint > | from codepoint > ; > > codepoint : '<' 'U' hexdigits '>' > ; > > to : octet > | to octet > ; > > octet : '\\' 'x' hexdigits > ; Your suggestion is \xAA\xAA\xBB\xBB \xCC\xCC for compound characters and leave <U3000> \xA1\xA1 for an ordinary single character. Did I get this one correct? But I still feel easy with a distinction between Unicode Character (codepoint != UTF8 octet) and octets. And as for octets, which representation do you think is correct? just UCS stacked or UTF-8? Dan the Encode Maintainer

