Re: Canonicalizing text parts to UTF-8 before applying body rules

Henrik K Wed, 30 May 2012 07:03:17 -0700

On Wed, May 30, 2012 at 02:43:54PM +0100, RW wrote:
> On Tue, 29 May 2012 15:58:21 -0400
> David F. Skoll wrote:
> 
> 
> > I'm thinking of making something (a plugin, maybe?) that canonicalizes
> > text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> > Something like:
> 
> > According to the perlunicode man page:
> > 
> >    Regular Expressions
> >        The regular expression compiler produces polymorphic opcodes.
> > That is, the pattern adapts to the data and automatically switches to
> >        the Unicode character scheme when presented with data that is
> >        internally encoded in UTF-8 -- or instead uses a traditional
> > byte scheme when presented with byte data.
> > 
> > so assuming we present it with proper UTF-8 data, the regexes should
> > Just Work.
> 
> UTF-8 wont work, it will need to be UTF-32 to be compatible with
> sa-compile.  From the re2c man page:
> 
> -u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
>        means the generated code can deal with any valid Unicode
>        character  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to
>        be supported you need to convert the incoming stream  to
>        UTF-32 upon  input yourself.


Frankly I believe there are so many dependencies in SA that all this is
impossible without modyfing the whole engine to support Unicode.  I don't
see a point in standalone plugin, what good does it do for the current SA
body rules?  The way the current eval-body-chunk-magic-tricks-code works
with all it dependencies I don't even know if it's possible to implement
similar stuff as "plugin".

Re: Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to