On 25 Feb 2021, at 13:37, Rick Cooper wrote:

I was just working on some rules to catch the current crop of mal formed urls used to escape detection by solutions that extract urls from emails and compare them to known bad urls and I am wondering if spamassassin's patterns
for extraction take this into account?

For instance:

https:www.google.com/mail
https:\/www.google.com/mail
https:\\www.google.com/mail

Will all work at getting you to gmail because the technical spec doesn't
actually require \\ after the colon.

Of course not: A http: URI must NOT contain '\\' after the colon, it MUST contain '//' after the colon. See https://tools.ietf.org/html/rfc7230#section-2.7.1 which is the technical spec for the formal syntax of a http URI. OTOH, there are URI schemes which do not include '//' (e.g. mailto:) so any tool that is doing broad URI detection can't be too picky.

What flavors of garbage almost-URIs will work in a browser very much depends on the whims of browser developers, and whether those are 'clickable' in your preferred MUA is dependent on the gullibility of your MUA author.

SpamAssassin traditionally has assumed that there will always be some MUA and browser authors who lack any sense of caution or prudence, so SA is VERY loose with what it will consider as maybe being a hostname in something that could be a URI in some obscure or novel scheme.

Will spamassassin still extract and normalize the urls above?

Yes, it will see all 3 as the same canonicalized URI.

I was hoping
to avoid digging through the source to find out.

No need to dig though the source, you can see what URIs SpamAssassin detects (trimmed of the parts after the hostname) in a message by manually testing it with 'spamassassin -D uri' Note that SA will only show one instance of otherwise identical URIs after trimming and canonicalization.

--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Reply via email to