On Tue, 29 May 2012 15:58:21 -0400 David F. Skoll wrote:
> I'm thinking of making something (a plugin, maybe?) that canonicalizes > text/* parts to UTF-8 and lets you write rules using Unicode regexes. > Something like: > According to the perlunicode man page: > > Regular Expressions > The regular expression compiler produces polymorphic opcodes. > That is, the pattern adapts to the data and automatically switches to > the Unicode character scheme when presented with data that is > internally encoded in UTF-8 -- or instead uses a traditional > byte scheme when presented with byte data. > > so assuming we present it with proper UTF-8 data, the regexes should > Just Work. UTF-8 wont work, it will need to be UTF-32 to be compatible with sa-compile. From the re2c man page: -u Generate a parser that supports Unicode chars (UTF-32). This means the generated code can deal with any valid Unicode character up to 0x10FFFF. When UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream to UTF-32 upon input yourself.