Feature Requests item #1000427, was opened at 2004-07-29 17:07 Message generated for change (Comment added) made by seier You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Priority: 5 Submitted By: Michael Engel (mkengel) Assigned to: Nobody/Anonymous (nobody) Summary: non-English spam; localized filters Initial Comment: How to deal with spam in a mixture of English/non-English mails* - it seems that they pass easily the filters * in my case English/German/French and Japanese Solution idea: localized filters, one after the other; should be possible to choose upon installation ---------------------------------------------------------------------- Comment By: Christian Blackburn (seier) Date: 2006-05-20 04:45 Message: Logged In: YES user_id=561770 Hi Gang, I think it's very important to be able to detect spam coming from a particular language. However, I think during installation the user should be asked what language(s) they speak and any message that qualifies as not being from one of their chosen languages, that also didn't originate from a friend (someone in their address book) should be deleted. If it is from a known user, it would be awesome if that person was written back reminding them that you only speak swahili (obviously, just for example), and that all messages must be sent in that language. Thanks, Christian Blackburn ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2004-10-05 07:13 Message: Logged In: YES user_id=529503 I'm working on non-English / multi-lingual tokenizer. See patch #824651. * This isn't compatible with original spambayes. ---------------------------------------------------------------------- Comment By: Michael Engel (mkengel) Date: 2004-08-09 00:04 Message: Logged In: YES user_id=780774 Thank you for the comments. I have waited a little bit to see if the training on German spam had an effect. It did, after a total of 4 weeks, SpamBayes now discovers these messages as spam (0.44 - my cutoff line is 0.35). Probably, there were not enough messages in German and French that SpamBayes could see the difference. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2004-08-02 23:42 Message: Logged In: YES user_id=552329 Is your ham also mixed language? With English/German/French, SpamBayes doesn't care about the language and will just learn each word as good/bad, so should work fine (with appropriate training). Have you trained on these sorts of spam? Attaching the clues for a misclassified message would give more insight into this. The Japanese is more difficult, because SpamBayes creates tokens by (mostly) splitting on whitespace, and this isn't how Asian languages work (we would get sentence tokens, I think). It's unlikely that we will ever handle this well, and the best solution would be to have someone (willing to do all the work) create a forked project that has a different tokeniser, customised for Asian langauges. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
