Re: Canonicalizing text parts to UTF-8 before applying body rules

RW Wed, 30 May 2012 06:44:28 -0700

On Tue, 29 May 2012 15:58:21 -0400
David F. Skoll wrote:


> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:

> According to the perlunicode man page:
> 
>    Regular Expressions
>        The regular expression compiler produces polymorphic opcodes.
> That is, the pattern adapts to the data and automatically switches to
>        the Unicode character scheme when presented with data that is
>        internally encoded in UTF-8 -- or instead uses a traditional
> byte scheme when presented with byte data.
> 
> so assuming we present it with proper UTF-8 data, the regexes should
> Just Work.

UTF-8 wont work, it will need to be UTF-32 to be compatible with
sa-compile.  From the re2c man page:

-u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
       means the generated code can deal with any valid Unicode
       character  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to
       be supported you need to convert the incoming stream  to
       UTF-32 upon  input yourself.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to