Thanks for the detailed analysis of the uri_detail plugin bug. I appreciate you taking the time to investigate this so thoroughly.
I'll open a bug report with the SpamAssassin project, including the details from your analysis and a sample spam email that demonstrates the problem. Thanks again for your help! On Mon, Feb 3, 2025 at 6:15 AM John Hardin <jhar...@impsec.org> wrote: > On Sun, 2 Feb 2025, Jimmy wrote: > > > dbg: uri: Not match: > > > text:\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5} > > not matches the > > > pattern:(?^aa:\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B1\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{99\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{97\\}\\\\x\\{E0\\}\\\\x\\{B8\\}\\\\x\\{B5\\}) > > with operator:=~ > > Okay, I finally had some time to sit down and poke at this rather than > just give a quick off-the-cuff shot in the dark. > > I think that the uri_detail plugin is broken w/r/t matching explicit bytes > in the anchor text using the \x00 notation, but it's a little beyond my > Perl skills and familiarity with the code base to completely analyze. The > problem may affect more than just uri_detail anchor text rules. > > The doubled backslashes are just logging behavior, they are not indicators > of a problem. The "\x{E0}" is just how Perl is formatting the raw E0 byte > for logging. The regex should not attempt to match *that* exactly. > > Non-hex escapes work properly; the existing rule __MXG_UNSUB_LINK01 which > contains "\s" successfully matches: > > dbg: uri: text matched: 'visit here to opt out or write a letter to the > address below' =~ /(?^aa:(?i)unsubscribe|opt[\\s-]out)/ > > ...and ASCII hex escapes work too; here's "opt out" as explicit hex bytes: > > dbg: uri: text matched: 'visit here to opt out or write a letter to the > address below' =~ /(?^aa:\\x6f\\x70\\x74 \\x6f\\x75\\x74)/ > > ...but \x00 notation for non-ASCII data does not: > > uri_detail __URIDETAIL_TEXT_UNICODE text =~ /\xe0/ > > uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running > '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' > > =~ /(?^aa:\\xe0)/ > > (does not match) > > > Explicit Unicode hex in regexes does work *outside* the uri_detail anchor > text context: > > body __UNICODE_BODY > /\xE0\xB8\x97\xE0\xB8\xB1\xE0\xB8\x99\xE0\xB8\x97\xE0\xB8\xB5/ > > dbg: rules: ran body rule __UNICODE_BODY ======> got hit: > "\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}" > > ...so you might want to just do a regular body rule for the anchor text > until this gets fixed. > > > Can anyone explain what the "(?^aa:" in the uri_detail regex means? I > suspect that's extremely relevant but I couldn't find anything online that > explains it. Maybe that specifies something related to character encoding > that's breaking the interpretation of the regex as a raw Unicode hex > string. > > That appears to be added by Mail::SpamAssassin::Util compile_regexp(), > apparently due to https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6802 > per > > https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?r1=1864964&r2=1864963&pathrev=1864964&diff_format=h > Maybe we need to be a little less eager to broadly apply /aa ? > > No, I took that out and it's still not hitting on the Unicode hex pattern: > > dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running > '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' > > =~ /(?^:\\xE0\\xB8\\x97)/ > > (does not match) > > FWIW pasting the raw unicode character into the regex also does not work: > > dbg: uri: uri_detail __URIDETAIL_TEXT_UNICODE text: running > '\x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5}' > > =~ /(?^aa:\x{E0}\x{B8}\x{95})/ > > (does not match) > > > > You should probably open a bug with your rule and attach the spample. > > > > > -- > John Hardin KA7OHZ http://www.impsec.org/~jhardin/ > jhar...@impsec.org pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > Rights can only ever be individual, which means that you cannot > gain a right by joining a mob, no matter how shiny the issued > badges are, or how many of your neighbors are part of it. -- Marko > ----------------------------------------------------------------------- > 10 days until Abraham Lincoln's and Charles Darwin's 216th Birthdays >