Gr33t1n6s @nd $alu+at:ons!
I'm trying to match on vaigra obfu'ed like this: "Viaġra". SA sends
the HTML through HTML::Entities::decode_entities which (on my system at
least) translates ġ to a unicode character. It gets written out as
the 2-byte sequence: \xC4\xA1.
My question is, how do I match the entity in a reasonable manner?
None of the following match:
body /\x{121}/
body /\x{0121}/
body /\x{C4A1}/
body /\&\#289;/
This does match:
body /\xC4\xA1/
If I can, I would like to match on a range of HTML entities (for instance,
the obfu'ed G above is within a range of G-like entities: [Ĝ-ģ].
After I realized I wasn't matching using literal unicode characters, I
thought of a workaround for matching these entity ranges. My assumption
was that adjacent HTML entities should translate to adjacent unicode
characters and thus their 2-byte representations would also be adjacent
(eg: \xC4[\xA0-\xAF]) but that assumption turns out to be wrong.
Also, I don't want to use rawbody, as mixing with html comments would then
easily defeat the obfu-detection.
Finally, does anyone know if decode_entities returns the same unicode
characters on every system or is it dependant on your locale (or
something)? I'm running perl 5.8.0, by the way.
Thanks!
--
Chris Thielen
Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk