[spambayes-bugs] [ spambayes-Patches-830290 ] url detection

SourceForge.net Sun, 22 May 2005 19:52:51 -0700

Patches item #830290, was opened at 2003-10-25 18:30
Message generated for change (Comment added) made by montanaro
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=830290&group_id=61702


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Toby Dickenson (htrd)
>Assigned to: Tony Meyer (anadelonbrin)
Summary: url detection

Initial Comment:
Ive been looking into a couple of unsures that generated 
suprisingly few tokens.... My mail reader detects some text 
as links because it begins &quot;www.&quot;, but spambayes needed 
the http:// prefix too. Replacing this with a skip token was 
a big loss. 
 
In fixing that, I found that this re had always matched a 
little too much.... It would match urls that start in the 
middle of words. It always generated tokens &quot;xxx&quot; and 
&quot;url:www&quot; for messages that contained &quot;xxxhttp://www&quot;. 
That wasnt so bad, but I guess we should avoid 
generating the same tokens for messages that contain 
&quot;xxxwww.&quot; 
 
To address this I have also changed the re to require that 
urls must start following a non-alphanumeric character. 
 
Sadly my end result is a much messier re.  
 
 

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2005-05-22 21:52

Message:
Logged In: YES 
user_id=44345

Handing to Tony...  Tony, should we strip the x- prefix from any number
of still experimental options or just close this and be on to other things?


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-02-23 11:27

Message:
Logged In: YES 
user_id=44345

I guess I should at least mark this request accepted.  I'll leave it
open for the time being until we decide to remove the
"x-"perimental prefix from the option.



----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-02-16 22:31

Message:
Logged In: YES 
user_id=552329

Note that this change is in the source now, controlled by the 
x-fancy_url_recognition option.  (In CVS, or in 1.0a9).

----------------------------------------------------------------------

Comment By: Toby Dickenson (htrd)
Date: 2004-01-08 05:35

Message:
Logged In: YES 
user_id=46460

I created this after getting a run of spams where the body was 
only a single url which were all incorrectly classified. There were 
no spam clue tokens in the header, and only the one skip token 
in the body. 
 
This patch fixed the classification of those messages. 
 
Ive not run the full tests to prove that it doesnt have some 
other bad effect. 
 
 

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-07 16:43

Message:
Logged In: YES 
user_id=44345

Deferring consideration of the url re until inside
URLStripper.__init__ didn't change anything.  My
conclusion is that this change has no effect other
than to slightly reduce the number of skip: tokens.
Toby, did it have a positive effect on SB performance
for you?


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-07 16:33

Message:
Logged In: YES 
user_id=44345

Tests are finished.  I saw *no change* at all.  Maybe the option
hasn't been properly initialized from the ini file at the time the
tokenizer module is loaded?  I'll fiddle some more.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-07 16:30

Message:
Logged In: YES 
user_id=44345

*sigh* stinkin' SF.  It accepted the modified diff I uploaded, but
lost my comments.  Oh well, after a little tweak for the x-
pick_apart_url code, it worked fine.  The number of tokens 
containing 'skip:w' dropped from 380 to 126 in my normal
database.  I'm making a 10x10 timcv run right now.  Details at 11.

Slightly fancier diff attached (and I deleted the other two) which
controls using a Tokenizer:x-fancy_url_recognition switch just
to make it easier to test.  I anticipate deleting that option if/when
the change is accepted.


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-01-06 19:04

Message:
Logged In: YES 
user_id=31435

It sounds like a good idea, but (of course) will need testing:  
the special-case tokenizing of URLs is the single biggest win 
the project ever got, so we can't afford any chance of 
screwing it up.  I *expect* this change will be a net win, 
though.

We can make the regexp prettier after we know whether it's a 
good idea <wink>.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=830290&group_id=61702
_______________________________________________
Spambayes-bugs mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-bugs

[spambayes-bugs] [ spambayes-Patches-830290 ] url detection

Reply via email to