Re: Language detection in TextCat

Marc Perkel Mon, 07 Dec 2009 11:02:47 -0800

Martin Gregorie wrote:

On Mon, 2009-12-07 at 08:55 -0800, Marc Perkel wrote:

Except for very short messages I would think that if you spell checked
the message in several languages and found that 80% was spelled
correctly that you have a match. You wouldn't have to check every
language, just start with some common ones and if you don't match them
go to less common ones.

It might work better if you inverted the test: if the textual content
appears to be badly misspelled in all the languages you accept then its
spam.

This should be fairly easy to do: configure SA with the language(s) you
will accept and the ratio of misspellings to total words that you'll
accept as meaning 'unwanted language' after numbers and HTML tags have
been excluded from the check. Apply the test to the whole body of a
non-MIME message or to all MIME parts with type="text/*".


Martin

OK - maybe this is a long shot but supposer you did this:

cat text.txt|aspell -a --lang=en |grep -v "*"|egrep -v "^$"|wc -l
cat text.txt|aspell -a --lang=fr |grep -v "*"|egrep -v "^$"|wc -l
...

What this would return is the number of misspelled lines in ech language. The language with the least misspellings is the correct language. Not sure how fast it would run or what you would want to do to the text first but is this an idea worth pursuing?

Re: Language detection in TextCat

Reply via email to