Re: Language detection in TextCat

2009-12-07 Thread Marc Perkel






Matt Kettler wrote:

  Marc Perkel wrote:
  
  
I'm wondering if the language detection in TextCat can be improved.
Here's the situation.

It appears that TextCat was designed to be inclusive. You list the
languages you want and it returns many possibilities so as not to
trigger unwanted falsely.

What I'm doing is extracting the language list for Exim where I hope
to offer a language reject list. The problem is that when you are
rejecting languages you want a smaller list that when you are
including languages to avoid false positives. I'd rather have a single
(non-english) result.

I'm wondering if there's a way to add some more options to alter the
behavior of the plugin so it is more optimized towards the idea of
rejecting languages?



  
  The language detection would have to be radically redesigned to have
enough accuracy support this.

Currently TextCat is a *very* crude match, and will often will return
multiple languages for plain English text.

Textcat is not designed to decide what language the email is, but to
find a set of languages it *might* be. It is very prone to declaring
extra languages that are not really present due to it's design.

This is useful in the "if it can't be my language, then it's garbage"
sense, but not so useful in a "reject if it could be this language I
don't like".  You'd really want "reject if it *IS* this language I don't
like", but textcat doesn't tell you what language an email is, only a
set of what it might be.

  


Any chance someone might be interested in a radical redesign? I think
language exclusion would be an extremely effective spam deterrent as
email in a language you don't speak is definitely spam. 

Doesn't Linux come with spelling dictionaries of words for a lot of
languages that are somehow hashed for speed for spell checking lookups?

Except for very short messages I would think that if you spell checked
the message in several languages and found that 80% was spelled
correctly that you have a match. You wouldn't have to check every
language, just start with some common ones and if you don't match them
go to less common ones. 

Would something like this be doable?





Re: Language detection in TextCat

2009-12-07 Thread LuKreme
On 7-Dec-2009, at 09:55, Marc Perkel wrote:
 Any chance someone might be interested in a radical redesign? I think 
 language exclusion would be an extremely effective spam deterrent as email in 
 a language you don't speak is definitely spam. 


Erm… not necessarily. As a general rule, this might be good for adding weight 
but I have gotten non-spam emails in a variety of languages I don't speak 
(Japanese, Italian, French, and Thai at least).

Some are because, for example, I have a random signature that is in Japanese. 
The Thai was because I asked a question on a mailing list about Thai subtitles 
and the reader misunderstood and thought I was looking for Thai subtitles.

As a spam marker, non-english would be useful for me, but not perfect1. For 
my mailserver I know for certain there are virtual accounts (so single UID) 
that communicate in French, Spanish, Arabic, Farsi, Japanese, and Korean at 
least. There's also a couple of African languages I don't know the names of.

1 Just like the vast majority of rules; there is no one single perfect rule 
with a 0% ham hit, not even BAYES_99, which might be why it's BAYES_99 and not 
BAYES_100.

-- 
Say, give it up, give it up, television's taking its toll
That's enough, that's enough, gimme the remote control
I've been nice, I've been good, please don't do this to me
Turn it off, turn it off, I don't want to have to see



Re: Language detection in TextCat

2009-12-07 Thread Martin Gregorie
On Mon, 2009-12-07 at 08:55 -0800, Marc Perkel wrote:

 Except for very short messages I would think that if you spell checked
 the message in several languages and found that 80% was spelled
 correctly that you have a match. You wouldn't have to check every
 language, just start with some common ones and if you don't match them
 go to less common ones. 
 
It might work better if you inverted the test: if the textual content
appears to be badly misspelled in all the languages you accept then its
spam.

This should be fairly easy to do: configure SA with the language(s) you
will accept and the ratio of misspellings to total words that you'll
accept as meaning 'unwanted language' after numbers and HTML tags have
been excluded from the check. Apply the test to the whole body of a
non-MIME message or to all MIME parts with type=text/*.


Martin



Re: Language detection in TextCat

2009-12-07 Thread Matus UHLAR - fantomas
Please, could you configure your MUA to quote, instead of colouring?
HTML mail sucks.

On 07.12.09 08:55, Marc Perkel wrote:
Any chance someone might be interested in a radical redesign? I think
language exclusion would be an extremely effective spam deterrent as email
in a language you don't speak is definitely spam.

How does this differ from the current status - user configures a few WANTED
languages, and all others are hittint UNWANTED_LANGUAGE?

Doesn't Linux come with spelling dictionaries of words for a lot of
languages that are somehow hashed for speed for spell checking lookups?
 
Except for very short messages I would think that if you spell checked the
message in several languages and found that 80% was spelled correctly that
you have a match. You wouldn't have to check every language, just start
with some common ones and if you don't match them go to less common ones.
 
Would something like this be doable?

that would be doable, but very slow and not very error-prone.
There are many words people often use and that are not in dictionaries and
vice versa - word misspelled in one language can be OK in other one.
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Chernobyl was an Windows 95 beta test site.


Re: Language detection in TextCat

2009-12-07 Thread Marc Perkel






Martin Gregorie wrote:

  On Mon, 2009-12-07 at 08:55 -0800, Marc Perkel wrote:
  
  
Except for very short messages I would think that if you spell checked
the message in several languages and found that 80% was spelled
correctly that you have a match. You wouldn't have to check every
language, just start with some common ones and if you don't match them
go to less common ones. 


  
  It might work better if you inverted the test: if the textual content
appears to be badly misspelled in all the languages you accept then its
spam.

This should be fairly easy to do: configure SA with the language(s) you
will accept and the ratio of misspellings to total words that you'll
accept as meaning 'unwanted language' after numbers and HTML tags have
been excluded from the check. Apply the test to the whole body of a
non-MIME message or to all MIME parts with type="text/*".


Martin

  


OK - maybe this is a long shot but supposer you did this:


cat text.txt|aspell -a --lang=en |grep -v "*"|egrep -v "^$"|wc -l
cat text.txt|aspell -a --lang=fr |grep -v "*"|egrep -v "^$"|wc -l
...

What this would return is the number of misspelled lines in ech
language. The language with the least misspellings is the correct
language. Not sure how fast it would run or what you would want to do
to the text first but is this an idea worth pursuing?






RE: Language detection in TextCat

2009-12-07 Thread R-Elists
 

 
 This should be fairly easy to do: configure SA with the 
 language(s) you will accept and the ratio of misspellings to 
 total words that you'll accept as meaning 'unwanted language' 
 after numbers and HTML tags have been excluded from the 
 check. Apply the test to the whole body of a non-MIME message 
 or to all MIME parts with type=text/*.

 Martin

The theory is sound in general...

yet the real world practice would be just another small score to add towards
the spamminess right?

there is just to much bad languange in text communications out there...
(pun intended)  ;-)

 - rh




Language detection in TextCat

2009-12-06 Thread Marc Perkel
I'm wondering if the language detection in TextCat can be improved. 
Here's the situation.


It appears that TextCat was designed to be inclusive. You list the 
languages you want and it returns many possibilities so as not to 
trigger unwanted falsely.


What I'm doing is extracting the language list for Exim where I hope to 
offer a language reject list. The problem is that when you are rejecting 
languages you want a smaller list that when you are including languages 
to avoid false positives. I'd rather have a single (non-english) result.


I'm wondering if there's a way to add some more options to alter the 
behavior of the plugin so it is more optimized towards the idea of 
rejecting languages?




Re: Language detection in TextCat

2009-12-06 Thread Matt Kettler
Marc Perkel wrote:
 I'm wondering if the language detection in TextCat can be improved.
 Here's the situation.

 It appears that TextCat was designed to be inclusive. You list the
 languages you want and it returns many possibilities so as not to
 trigger unwanted falsely.

 What I'm doing is extracting the language list for Exim where I hope
 to offer a language reject list. The problem is that when you are
 rejecting languages you want a smaller list that when you are
 including languages to avoid false positives. I'd rather have a single
 (non-english) result.

 I'm wondering if there's a way to add some more options to alter the
 behavior of the plugin so it is more optimized towards the idea of
 rejecting languages?


The language detection would have to be radically redesigned to have
enough accuracy support this.

Currently TextCat is a *very* crude match, and will often will return
multiple languages for plain English text.

Textcat is not designed to decide what language the email is, but to
find a set of languages it *might* be. It is very prone to declaring
extra languages that are not really present due to it's design.

This is useful in the if it can't be my language, then it's garbage
sense, but not so useful in a reject if it could be this language I
don't like.  You'd really want reject if it *IS* this language I don't
like, but textcat doesn't tell you what language an email is, only a
set of what it might be.



Re: Language detection in TextCat

2009-12-06 Thread Henrik K
On Sun, Dec 06, 2009 at 11:49:25PM -0500, Matt Kettler wrote:
 Marc Perkel wrote:
  I'm wondering if the language detection in TextCat can be improved.
  Here's the situation.
 
  It appears that TextCat was designed to be inclusive. You list the
  languages you want and it returns many possibilities so as not to
  trigger unwanted falsely.
 
  What I'm doing is extracting the language list for Exim where I hope
  to offer a language reject list. The problem is that when you are
  rejecting languages you want a smaller list that when you are
  including languages to avoid false positives. I'd rather have a single
  (non-english) result.
 
  I'm wondering if there's a way to add some more options to alter the
  behavior of the plugin so it is more optimized towards the idea of
  rejecting languages?
 
 
 The language detection would have to be radically redesigned to have
 enough accuracy support this.
 
 Currently TextCat is a *very* crude match, and will often will return
 multiple languages for plain English text.
 
 Textcat is not designed to decide what language the email is, but to
 find a set of languages it *might* be. It is very prone to declaring
 extra languages that are not really present due to it's design.
 
 This is useful in the if it can't be my language, then it's garbage
 sense, but not so useful in a reject if it could be this language I
 don't like.  You'd really want reject if it *IS* this language I don't
 like, but textcat doesn't tell you what language an email is, only a
 set of what it might be.

Also beware of the case bug:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229

I've got ok results with my corpus with textcat_acceptable_score ~1.02 and
textcat_max_languages ~1-2. Of course I wouldn't plain reject anything..



Re: Language detection in TextCat

2009-12-06 Thread Matus UHLAR - fantomas
On 06.12.09 11:39, Marc Perkel wrote:
 I'm wondering if the language detection in TextCat can be improved.  
 Here's the situation.

 It appears that TextCat was designed to be inclusive. You list the  
 languages you want and it returns many possibilities so as not to  
 trigger unwanted falsely.

 What I'm doing is extracting the language list for Exim where I hope to  
 offer a language reject list. The problem is that when you are rejecting  
 languages you want a smaller list that when you are including languages  
 to avoid false positives. I'd rather have a single (non-english) result.

 I'm wondering if there's a way to add some more options to alter the  
 behavior of the plugin so it is more optimized towards the idea of  
 rejecting languages?

What's the point? Why do you think that would an improvement?
I think that most of people speak a few languages while there are dozens of
languages they do not speak. Why do you think people would want all but a few
languages? 
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
REALITY.SYS corrupted. Press any key to reboot Universe.