Thanks for the detailed analysis of the uri_detail plugin bug.  I
appreciate you taking the time to investigate this so thoroughly.

I'll open a bug report with the SpamAssassin project, including the details
from your analysis and a sample spam email that demonstrates the problem.

Thanks again for your help!


On Mon, Feb 3, 2025 at 6:15 AM John Hardin <jhar...@impsec.org> wrote:

> On Sun, 2 Feb 2025, Jimmy wrote:
>
> > dbg: uri: Not match:
> >
> text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}
> > not matches the
> >
> pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\})
> > with operator:=~
>
> Okay, I finally had some time to sit down and poke at this rather than
> just give a quick off-the-cuff shot in the dark.
>
> I think that the uri_detail plugin is broken w/r/t matching explicit bytes
> in the anchor text using the \x00 notation, but it's a little beyond my
> Perl skills and familiarity with the code base to completely analyze. The
> problem may affect more than just uri_detail anchor text rules.
>
> The doubled backslashes are just logging behavior, they are not indicators
> of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte
> for logging. The regex should not attempt to match *that* exactly.
>
> Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which
> contains "\s" successfully matches:
>
> dbg: uri: text matched: 'visit here to opt out or write a letter to the
> address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/
>
> ...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes:
>
> dbg: uri: text matched: 'visit here to opt out or write a letter to the
> address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/
>
> ...but \x00 notation for non-ASCII data does not:
>
> uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/
>
> uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>
> =~ /(?^aa:\\xe0)/
>
> (does not match)
>
>
> Explicit Unicode hex in regexes does work *outside* the uri_detail anchor
> text context:
>
> body  __UNICODE_BODY
> /\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/
>
> dbg: rules: ran body rule __UNICODE_BODY ======> got hit:
> "\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}"
>
> ...so you might want to just do a regular body rule for the anchor text
> until this gets fixed.
>
>
> Can anyone explain what the "(?^aa:" in the uri_detail regex means? I
> suspect that's extremely relevant but I couldn't find anything online that
> explains it. Maybe that specifies something related to character encoding
> that's breaking the interpretation of the regex as a raw Unicode hex
> string.
>
> That appears to be added by Mail::SpamAssassin::Util compile_regexp(),
> apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802
> per
>
> https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h
> Maybe we need to be a little less eager to broadly apply /aa ?
>
> No, I took that out and it's still not hitting on the Unicode hex pattern:
>
> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>
> =~ /(?^:\\xE0\\xB8\\x97)/
>
> (does not match)
>
> FWIW pasting the raw unicode character into the regex also does not work:
>
> dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running
> '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}'
>
> =~ /(?^aa:\x{E0}\x{B8}\x{95})/
>
> (does not match)
>
>
>
> You should probably open a bug with your rule and attach the spample.
>
>
>
>
> --
>   John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
>   jhar...@impsec.org                         pgpk -a jhar...@impsec.org
>   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>    Rights can only ever be individual, which means that you cannot
>    gain a right by joining a mob, no matter how shiny the issued
>    badges are, or how many of your neighbors are part of it.  -- Marko
> -----------------------------------------------------------------------
>   10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays
>

Reply via email to