[SAtalk] Matching HTML entities/unicode characters

Chris Thielen Tue, 18 Nov 2003 11:27:01 -0800

Gr33t1n6s @nd $alu+at:ons!

I'm trying to match on vaigra obfu'ed like this:  "Via&#289;ra".  SA sends
the HTML through HTML::Entities::decode_entities which (on my system at
least) translates &#289; to a unicode character.  It gets written out as
the 2-byte sequence: \xC4\xA1.


My question is, how do I match the entity in a reasonable manner?

None of the following match:
body /\x{121}/
body /\x{0121}/
body /\x{C4A1}/
body /\&\#289;/

This does match:
body /\xC4\xA1/

If I can, I would like to match on a range of HTML entities (for instance,
the obfu'ed G above is within a range of G-like entities: [&#284;-&#291;].

After I realized I wasn't matching using literal unicode characters, I
thought of a workaround for matching these entity ranges.  My assumption
was that adjacent HTML entities should translate to adjacent unicode
characters and thus their 2-byte representations would also be adjacent
(eg: \xC4[\xA0-\xAF]) but that assumption turns out to be wrong.

Also, I don't want to use rawbody, as mixing with html comments would then
easily defeat the obfu-detection.



Finally, does anyone know if decode_entities returns the same unicode
characters on every system or is it dependant on your locale (or
something)?  I'm running perl 5.8.0, by the way.

Thanks!

--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Matching HTML entities/unicode characters

Reply via email to