Patches item #830290, was opened at 2003-10-25 18:30 Message generated for change (Comment added) made by montanaro You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=830290&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: Accepted Priority: 5 Submitted By: Toby Dickenson (htrd) >Assigned to: Tony Meyer (anadelonbrin) Summary: url detection Initial Comment: Ive been looking into a couple of unsures that generated suprisingly few tokens.... My mail reader detects some text as links because it begins "www.", but spambayes needed the http:// prefix too. Replacing this with a skip token was a big loss. In fixing that, I found that this re had always matched a little too much.... It would match urls that start in the middle of words. It always generated tokens "xxx" and "url:www" for messages that contained "xxxhttp://www". That wasnt so bad, but I guess we should avoid generating the same tokens for messages that contain "xxxwww." To address this I have also changed the re to require that urls must start following a non-alphanumeric character. Sadly my end result is a much messier re. ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2005-05-22 21:52 Message: Logged In: YES user_id=44345 Handing to Tony... Tony, should we strip the x- prefix from any number of still experimental options or just close this and be on to other things? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2004-02-23 11:27 Message: Logged In: YES user_id=44345 I guess I should at least mark this request accepted. I'll leave it open for the time being until we decide to remove the "x-"perimental prefix from the option. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2004-02-16 22:31 Message: Logged In: YES user_id=552329 Note that this change is in the source now, controlled by the x-fancy_url_recognition option. (In CVS, or in 1.0a9). ---------------------------------------------------------------------- Comment By: Toby Dickenson (htrd) Date: 2004-01-08 05:35 Message: Logged In: YES user_id=46460 I created this after getting a run of spams where the body was only a single url which were all incorrectly classified. There were no spam clue tokens in the header, and only the one skip token in the body. This patch fixed the classification of those messages. Ive not run the full tests to prove that it doesnt have some other bad effect. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2004-01-07 16:43 Message: Logged In: YES user_id=44345 Deferring consideration of the url re until inside URLStripper.__init__ didn't change anything. My conclusion is that this change has no effect other than to slightly reduce the number of skip: tokens. Toby, did it have a positive effect on SB performance for you? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2004-01-07 16:33 Message: Logged In: YES user_id=44345 Tests are finished. I saw *no change* at all. Maybe the option hasn't been properly initialized from the ini file at the time the tokenizer module is loaded? I'll fiddle some more. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2004-01-07 16:30 Message: Logged In: YES user_id=44345 *sigh* stinkin' SF. It accepted the modified diff I uploaded, but lost my comments. Oh well, after a little tweak for the x- pick_apart_url code, it worked fine. The number of tokens containing 'skip:w' dropped from 380 to 126 in my normal database. I'm making a 10x10 timcv run right now. Details at 11. Slightly fancier diff attached (and I deleted the other two) which controls using a Tokenizer:x-fancy_url_recognition switch just to make it easier to test. I anticipate deleting that option if/when the change is accepted. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2004-01-06 19:04 Message: Logged In: YES user_id=31435 It sounds like a good idea, but (of course) will need testing: the special-case tokenizing of URLs is the single biggest win the project ever got, so we can't afford any chance of screwing it up. I *expect* this change will be a net win, though. We can make the regexp prettier after we know whether it's a good idea <wink>. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=830290&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
