Gr33t1n6s @nd $alu+at:ons! I'm trying to match on vaigra obfu'ed like this: "Viaġra". SA sends the HTML through HTML::Entities::decode_entities which (on my system at least) translates ġ to a unicode character. It gets written out as the 2-byte sequence: \xC4\xA1.
My question is, how do I match the entity in a reasonable manner? None of the following match: body /\x{121}/ body /\x{0121}/ body /\x{C4A1}/ body /\&\#289;/ This does match: body /\xC4\xA1/ If I can, I would like to match on a range of HTML entities (for instance, the obfu'ed G above is within a range of G-like entities: [Ĝ-ģ]. After I realized I wasn't matching using literal unicode characters, I thought of a workaround for matching these entity ranges. My assumption was that adjacent HTML entities should translate to adjacent unicode characters and thus their 2-byte representations would also be adjacent (eg: \xC4[\xA0-\xAF]) but that assumption turns out to be wrong. Also, I don't want to use rawbody, as mixing with html comments would then easily defeat the obfu-detection. Finally, does anyone know if decode_entities returns the same unicode characters on every system or is it dependant on your locale (or something)? I'm running perl 5.8.0, by the way. Thanks! -- Chris Thielen Easily generate SpamAssassin rules to catch obfuscated spam phrases: http://www.sandgnat.com/cmos/ ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk