Please keep the discussion on-list so others may benefit or make suggestions, thanks.

On Tue, 25 Sep 2012, Alexandre Boyer wrote:

I totaly agree. I think that the HTML parser could easily handle this.

To the best of my knowledge, there is no modifier (like :addr or :name for From header checks) that one can use to have trustworthy data on an uri check. Or am I wrong?

Nope. The problem is you want to compare two parts of one link, not just see if one part matches something specific, so the sub-parts modifiers model isn't useful.

I'm thinking something like this, using what you presented as an example:

Generated internal pseudo-header:
   X-Spam-URL:    
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec

...basically, URL-part|displayed-text-part

(Suggestions for a more appropriate delimiter than "|" are solicited...)

Repeat the header for each URL found that has displayed text; only include those where the displayed text is not the same as the URL.

Then you could write a simple bounded rule like:

header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i

As long as you're already capturing the data, it might be useful to generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with URL-domain|displayed-text-domain only if the displayed text looks like a URL. For example:

    X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com

Would it be a big job to implement this in SA PMS or HTMLParser plugin? I'm
not a guru, but I can certainly have the job done if I have someone to read
it and suggest correction.

Dunno, I haven't looked at that part of the code.

How one could submit patches to the dev team?

If you have something that is tested and works, open a bug in the Apache Bugzilla for SpamAssassin describing what the patch does, mark it as an enhancement, and attach the patch. You might need to provide a copyright assignment (or sign a CLA) for it to be included.

On Tue, 25 Sep 2012, Alexandre Boyer wrote:

 It's essentially FREEMAIL_FROM and the body only contains a fake Youtube
link like:

   <html><a
   href=3D"http://www.probono.fr/**95280_pdf<http://www.probono.fr/95280_pdf>
">http://www.youtube.**com/wa= <http://www.youtube.com/wa=>
   tch?v=3D3VvOFqaHbL5&feature=**3Dg-vrec&feature=3Dg-vrec</a><**
/B><BR></html>


This topic comes up regularly enough that it should be a FAQ.

As a general rule, checking for URL vs. visible text mismatch is not safe,
this is done legitimately quite a lot. The S/O would be very low.

However, for specific displayed-URL domains it _might_ be productive, as
you're suggesting.

 I ended with a regex for this kind of thing:

   full       AJB_UTUBE_BADLINK    m'\shref=.{0,3}(https?://)?(**
www\.)?(?!youtube)[^\.]+\.[^>]**+>(https?://)?(www\.)?youtube\**.'mi
   score      AJB_UTUBE_BADLINK    0 # 3.0


There are so many poaaible ways this could be obfuscated that a regex
approach would quickly get inhumanly complex and degenerate into a game of
whack-a-mole. Consider the href + display text in a quoted-unreadble HTML
body like your sample where line breaks are inserted randomly from message
to message.

This would be much more productively handled within the HTML parser, where
the encoding and such is cleaned up and the URL and the cleaned-up display
text can be easily extracted.

Perhaps inserting pseudo-headers for URL + displaytext pairs would be a
good way to expose this information to rules.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  I'm seriously considering getting one of those bright-orange prison
  overalls and stencilling PASSENGER on the back. Along with the paper
  slippers, I ought to be able to walk right through security.
                                             -- Brian Kantor in a.s.r
-----------------------------------------------------------------------
 117 days since the first successful private support mission to ISS (SpaceX)

Reply via email to