Re: Language detection in TextCat
Matt Kettler wrote: Marc Perkel wrote: I'm wondering if the language detection in TextCat can be improved. Here's the situation. It appears that TextCat was designed to be inclusive. You list the languages you want and it returns many possibilities so as not to trigger unwanted falsely. What I'm doing is extracting the language list for Exim where I hope to offer a language reject list. The problem is that when you are rejecting languages you want a smaller list that when you are including languages to avoid false positives. I'd rather have a single (non-english) result. I'm wondering if there's a way to add some more options to alter the behavior of the plugin so it is more optimized towards the idea of rejecting languages? The language detection would have to be radically redesigned to have enough accuracy support this. Currently TextCat is a *very* crude match, and will often will return multiple languages for plain English text. Textcat is not designed to decide what language the email is, but to find a set of languages it *might* be. It is very prone to declaring extra languages that are not really present due to it's design. This is useful in the "if it can't be my language, then it's garbage" sense, but not so useful in a "reject if it could be this language I don't like". You'd really want "reject if it *IS* this language I don't like", but textcat doesn't tell you what language an email is, only a set of what it might be. Any chance someone might be interested in a radical redesign? I think language exclusion would be an extremely effective spam deterrent as email in a language you don't speak is definitely spam. Doesn't Linux come with spelling dictionaries of words for a lot of languages that are somehow hashed for speed for spell checking lookups? Except for very short messages I would think that if you spell checked the message in several languages and found that 80% was spelled correctly that you have a match. You wouldn't have to check every language, just start with some common ones and if you don't match them go to less common ones. Would something like this be doable?
Re: Language detection in TextCat
On 7-Dec-2009, at 09:55, Marc Perkel wrote: Any chance someone might be interested in a radical redesign? I think language exclusion would be an extremely effective spam deterrent as email in a language you don't speak is definitely spam. Erm… not necessarily. As a general rule, this might be good for adding weight but I have gotten non-spam emails in a variety of languages I don't speak (Japanese, Italian, French, and Thai at least). Some are because, for example, I have a random signature that is in Japanese. The Thai was because I asked a question on a mailing list about Thai subtitles and the reader misunderstood and thought I was looking for Thai subtitles. As a spam marker, non-english would be useful for me, but not perfect1. For my mailserver I know for certain there are virtual accounts (so single UID) that communicate in French, Spanish, Arabic, Farsi, Japanese, and Korean at least. There's also a couple of African languages I don't know the names of. 1 Just like the vast majority of rules; there is no one single perfect rule with a 0% ham hit, not even BAYES_99, which might be why it's BAYES_99 and not BAYES_100. -- Say, give it up, give it up, television's taking its toll That's enough, that's enough, gimme the remote control I've been nice, I've been good, please don't do this to me Turn it off, turn it off, I don't want to have to see
Re: Language detection in TextCat
On Mon, 2009-12-07 at 08:55 -0800, Marc Perkel wrote: Except for very short messages I would think that if you spell checked the message in several languages and found that 80% was spelled correctly that you have a match. You wouldn't have to check every language, just start with some common ones and if you don't match them go to less common ones. It might work better if you inverted the test: if the textual content appears to be badly misspelled in all the languages you accept then its spam. This should be fairly easy to do: configure SA with the language(s) you will accept and the ratio of misspellings to total words that you'll accept as meaning 'unwanted language' after numbers and HTML tags have been excluded from the check. Apply the test to the whole body of a non-MIME message or to all MIME parts with type=text/*. Martin
Re: Language detection in TextCat
Please, could you configure your MUA to quote, instead of colouring? HTML mail sucks. On 07.12.09 08:55, Marc Perkel wrote: Any chance someone might be interested in a radical redesign? I think language exclusion would be an extremely effective spam deterrent as email in a language you don't speak is definitely spam. How does this differ from the current status - user configures a few WANTED languages, and all others are hittint UNWANTED_LANGUAGE? Doesn't Linux come with spelling dictionaries of words for a lot of languages that are somehow hashed for speed for spell checking lookups? Except for very short messages I would think that if you spell checked the message in several languages and found that 80% was spelled correctly that you have a match. You wouldn't have to check every language, just start with some common ones and if you don't match them go to less common ones. Would something like this be doable? that would be doable, but very slow and not very error-prone. There are many words people often use and that are not in dictionaries and vice versa - word misspelled in one language can be OK in other one. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Chernobyl was an Windows 95 beta test site.
Re: Language detection in TextCat
Martin Gregorie wrote: On Mon, 2009-12-07 at 08:55 -0800, Marc Perkel wrote: Except for very short messages I would think that if you spell checked the message in several languages and found that 80% was spelled correctly that you have a match. You wouldn't have to check every language, just start with some common ones and if you don't match them go to less common ones. It might work better if you inverted the test: if the textual content appears to be badly misspelled in all the languages you accept then its spam. This should be fairly easy to do: configure SA with the language(s) you will accept and the ratio of misspellings to total words that you'll accept as meaning 'unwanted language' after numbers and HTML tags have been excluded from the check. Apply the test to the whole body of a non-MIME message or to all MIME parts with type="text/*". Martin OK - maybe this is a long shot but supposer you did this: cat text.txt|aspell -a --lang=en |grep -v "*"|egrep -v "^$"|wc -l cat text.txt|aspell -a --lang=fr |grep -v "*"|egrep -v "^$"|wc -l ... What this would return is the number of misspelled lines in ech language. The language with the least misspellings is the correct language. Not sure how fast it would run or what you would want to do to the text first but is this an idea worth pursuing?
RE: Language detection in TextCat
This should be fairly easy to do: configure SA with the language(s) you will accept and the ratio of misspellings to total words that you'll accept as meaning 'unwanted language' after numbers and HTML tags have been excluded from the check. Apply the test to the whole body of a non-MIME message or to all MIME parts with type=text/*. Martin The theory is sound in general... yet the real world practice would be just another small score to add towards the spamminess right? there is just to much bad languange in text communications out there... (pun intended) ;-) - rh
Language detection in TextCat
I'm wondering if the language detection in TextCat can be improved. Here's the situation. It appears that TextCat was designed to be inclusive. You list the languages you want and it returns many possibilities so as not to trigger unwanted falsely. What I'm doing is extracting the language list for Exim where I hope to offer a language reject list. The problem is that when you are rejecting languages you want a smaller list that when you are including languages to avoid false positives. I'd rather have a single (non-english) result. I'm wondering if there's a way to add some more options to alter the behavior of the plugin so it is more optimized towards the idea of rejecting languages?
Re: Language detection in TextCat
Marc Perkel wrote: I'm wondering if the language detection in TextCat can be improved. Here's the situation. It appears that TextCat was designed to be inclusive. You list the languages you want and it returns many possibilities so as not to trigger unwanted falsely. What I'm doing is extracting the language list for Exim where I hope to offer a language reject list. The problem is that when you are rejecting languages you want a smaller list that when you are including languages to avoid false positives. I'd rather have a single (non-english) result. I'm wondering if there's a way to add some more options to alter the behavior of the plugin so it is more optimized towards the idea of rejecting languages? The language detection would have to be radically redesigned to have enough accuracy support this. Currently TextCat is a *very* crude match, and will often will return multiple languages for plain English text. Textcat is not designed to decide what language the email is, but to find a set of languages it *might* be. It is very prone to declaring extra languages that are not really present due to it's design. This is useful in the if it can't be my language, then it's garbage sense, but not so useful in a reject if it could be this language I don't like. You'd really want reject if it *IS* this language I don't like, but textcat doesn't tell you what language an email is, only a set of what it might be.
Re: Language detection in TextCat
On Sun, Dec 06, 2009 at 11:49:25PM -0500, Matt Kettler wrote: Marc Perkel wrote: I'm wondering if the language detection in TextCat can be improved. Here's the situation. It appears that TextCat was designed to be inclusive. You list the languages you want and it returns many possibilities so as not to trigger unwanted falsely. What I'm doing is extracting the language list for Exim where I hope to offer a language reject list. The problem is that when you are rejecting languages you want a smaller list that when you are including languages to avoid false positives. I'd rather have a single (non-english) result. I'm wondering if there's a way to add some more options to alter the behavior of the plugin so it is more optimized towards the idea of rejecting languages? The language detection would have to be radically redesigned to have enough accuracy support this. Currently TextCat is a *very* crude match, and will often will return multiple languages for plain English text. Textcat is not designed to decide what language the email is, but to find a set of languages it *might* be. It is very prone to declaring extra languages that are not really present due to it's design. This is useful in the if it can't be my language, then it's garbage sense, but not so useful in a reject if it could be this language I don't like. You'd really want reject if it *IS* this language I don't like, but textcat doesn't tell you what language an email is, only a set of what it might be. Also beware of the case bug: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229 I've got ok results with my corpus with textcat_acceptable_score ~1.02 and textcat_max_languages ~1-2. Of course I wouldn't plain reject anything..
Re: Language detection in TextCat
On 06.12.09 11:39, Marc Perkel wrote: I'm wondering if the language detection in TextCat can be improved. Here's the situation. It appears that TextCat was designed to be inclusive. You list the languages you want and it returns many possibilities so as not to trigger unwanted falsely. What I'm doing is extracting the language list for Exim where I hope to offer a language reject list. The problem is that when you are rejecting languages you want a smaller list that when you are including languages to avoid false positives. I'd rather have a single (non-english) result. I'm wondering if there's a way to add some more options to alter the behavior of the plugin so it is more optimized towards the idea of rejecting languages? What's the point? Why do you think that would an improvement? I think that most of people speak a few languages while there are dozens of languages they do not speak. Why do you think people would want all but a few languages? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. REALITY.SYS corrupted. Press any key to reboot Universe.