Re: HTML link regex

John Hardin Tue, 25 Sep 2012 22:13:32 -0700

Please keep the discussion on-list so others may benefit or makesuggestions, thanks.


On Tue, 25 Sep 2012, Alexandre Boyer wrote:

I totaly agree. I think that the HTML parser could easily handle this.
To the best of my knowledge, there is no modifier (like :addr or :namefor From header checks) that one can use to have trustworthy data on anuri check. Or am I wrong?

Nope. The problem is you want to compare two parts of one link, not justsee if one part matches something specific, so the sub-parts modifiersmodel isn't useful.


I'm thinking something like this, using what you presented as an example:

Generated internal pseudo-header:
   X-Spam-URL:    
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec

...basically, URL-part|displayed-text-part

(Suggestions for a more appropriate delimiter than "|" are solicited...)

Repeat the header for each URL found that has displayed text; onlyinclude those where the displayed text is not the same as the URL.


Then you could write a simple bounded rule like:

header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i

As long as you're already capturing the data, it might be useful togenerate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly withURL-domain|displayed-text-domain only if the displayed text looks like aURL. For example:


    X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com

Would it be a big job to implement this in SA PMS or HTMLParser plugin? I'm
not a guru, but I can certainly have the job done if I have someone to read
it and suggest correction.


Dunno, I haven't looked at that part of the code.

How one could submit patches to the dev team?

If you have something that is tested and works, open a bug in the ApacheBugzilla for SpamAssassin describing what the patch does, mark it as anenhancement, and attach the patch. You might need to provide a copyrightassignment (or sign a CLA) for it to be included.

On Tue, 25 Sep 2012, Alexandre Boyer wrote:

 It's essentially FREEMAIL_FROM and the body only contains a fake Youtube

link like:

   <html><a
   href=3D"http://www.probono.fr/**95280_pdf<http://www.probono.fr/95280_pdf>
">http://www.youtube.**com/wa= <http://www.youtube.com/wa=>
   tch?v=3D3VvOFqaHbL5&feature=**3Dg-vrec&feature=3Dg-vrec</a><**
/B><BR></html>


This topic comes up regularly enough that it should be a FAQ.

As a general rule, checking for URL vs. visible text mismatch is not safe,
this is done legitimately quite a lot. The S/O would be very low.

However, for specific displayed-URL domains it _might_ be productive, as
you're suggesting.

 I ended with a regex for this kind of thing:

   full       AJB_UTUBE_BADLINK    m'\shref=.{0,3}(https?://)?(**
www\.)?(?!youtube)[^\.]+\.[^>]**+>(https?://)?(www\.)?youtube\**.'mi
   score      AJB_UTUBE_BADLINK    0 # 3.0


There are so many poaaible ways this could be obfuscated that a regex
approach would quickly get inhumanly complex and degenerate into a game of
whack-a-mole. Consider the href + display text in a quoted-unreadble HTML
body like your sample where line breaks are inserted randomly from message
to message.

This would be much more productively handled within the HTML parser, where
the encoding and such is cleaned up and the URL and the cleaned-up display
text can be easily extracted.

Perhaps inserting pseudo-headers for URL + displaytext pairs would be a
good way to expose this information to rules.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  I'm seriously considering getting one of those bright-orange prison
  overalls and stencilling PASSENGER on the back. Along with the paper
  slippers, I ought to be able to walk right through security.
                                             -- Brian Kantor in a.s.r
-----------------------------------------------------------------------
 117 days since the first successful private support mission to ISS (SpaceX)

Re: HTML link regex

Reply via email to