On Tuesday, April 2, 2002, at 01:24 , Nick Ing-Simmons wrote:
> Dan Kogai <[EMAIL PROTECTED]> writes:
>>>>
>>>> I don't like the <UNNNN+UMMMM> part it will make the parsing messier.
>>>>
>>>> The \xYY\xYY is of course what I meant ;-)
>>>
>>> Not that much.  It's just a regex after all.
>
> For _perl_ it is but if we are going to get IBM's ICU or others
> to back-port it then it is better to keep things clean.

Point well taken.

> So let us have yacc-like:
>
> from : codepoint
>      | from codepoint
>      ;
>
> codepoint : '<' 'U' hexdigits '>'
>           ;
>
> to   : octet
>      | to octet
>      ;
>
> octet : '\\' 'x' hexdigits
>       ;

Your suggestion is

\xAA\xAA\xBB\xBB        \xCC\xCC

for compound characters and leave

<U3000> \xA1\xA1

for an ordinary single character.  Did I get this one correct?
But I still feel easy with a distinction between Unicode Character 
(codepoint != UTF8 octet) and octets.  And as for octets, which 
representation do you think is correct?  just UCS stacked or UTF-8?

Dan the Encode Maintainer


Reply via email to