On Wed, May 30, 2012 at 02:43:54PM +0100, RW wrote: > On Tue, 29 May 2012 15:58:21 -0400 > David F. Skoll wrote: > > > > I'm thinking of making something (a plugin, maybe?) that canonicalizes > > text/* parts to UTF-8 and lets you write rules using Unicode regexes. > > Something like: > > > According to the perlunicode man page: > > > > Regular Expressions > > The regular expression compiler produces polymorphic opcodes. > > That is, the pattern adapts to the data and automatically switches to > > the Unicode character scheme when presented with data that is > > internally encoded in UTF-8 -- or instead uses a traditional > > byte scheme when presented with byte data. > > > > so assuming we present it with proper UTF-8 data, the regexes should > > Just Work. > > UTF-8 wont work, it will need to be UTF-32 to be compatible with > sa-compile. From the re2c man page: > > -u Generate a parser that supports Unicode chars (UTF-32). This > means the generated code can deal with any valid Unicode > character up to 0x10FFFF. When UTF-8 or UTF-16 needs to > be supported you need to convert the incoming stream to > UTF-32 upon input yourself.
Frankly I believe there are so many dependencies in SA that all this is impossible without modyfing the whole engine to support Unicode. I don't see a point in standalone plugin, what good does it do for the current SA body rules? The way the current eval-body-chunk-magic-tricks-code works with all it dependencies I don't even know if it's possible to implement similar stuff as "plugin".