On 25 Feb 2021, at 13:37, Rick Cooper wrote:
I was just working on some rules to catch the current crop of mal
formed
urls used to escape detection by solutions that extract urls from
emails and
compare them to known bad urls and I am wondering if spamassassin's
patterns
for extraction take this into account?
For instance:
https:www.google.com/mail
https:\/www.google.com/mail
https:\\www.google.com/mail
Will all work at getting you to gmail because the technical spec
doesn't
actually require \\ after the colon.
Of course not: A http: URI must NOT contain '\\' after the colon, it
MUST contain '//' after the colon. See
https://tools.ietf.org/html/rfc7230#section-2.7.1 which is the technical
spec for the formal syntax of a http URI. OTOH, there are URI schemes
which do not include '//' (e.g. mailto:) so any tool that is doing broad
URI detection can't be too picky.
What flavors of garbage almost-URIs will work in a browser very much
depends on the whims of browser developers, and whether those are
'clickable' in your preferred MUA is dependent on the gullibility of
your MUA author.
SpamAssassin traditionally has assumed that there will always be some
MUA and browser authors who lack any sense of caution or prudence, so SA
is VERY loose with what it will consider as maybe being a hostname in
something that could be a URI in some obscure or novel scheme.
Will spamassassin still extract and normalize the urls above?
Yes, it will see all 3 as the same canonicalized URI.
I was hoping
to avoid digging through the source to find out.
No need to dig though the source, you can see what URIs SpamAssassin
detects (trimmed of the parts after the hostname) in a message by
manually testing it with 'spamassassin -D uri' Note that SA will only
show one instance of otherwise identical URIs after trimming and
canonicalization.
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire