Matt Kettler wrote:
Marc Perkel wrote:
  
I'm wondering if the language detection in TextCat can be improved.
Here's the situation.

It appears that TextCat was designed to be inclusive. You list the
languages you want and it returns many possibilities so as not to
trigger unwanted falsely.

What I'm doing is extracting the language list for Exim where I hope
to offer a language reject list. The problem is that when you are
rejecting languages you want a smaller list that when you are
including languages to avoid false positives. I'd rather have a single
(non-english) result.

I'm wondering if there's a way to add some more options to alter the
behavior of the plugin so it is more optimized towards the idea of
rejecting languages?


    
The language detection would have to be radically redesigned to have
enough accuracy support this.

Currently TextCat is a *very* crude match, and will often will return
multiple languages for plain English text.

Textcat is not designed to decide what language the email is, but to
find a set of languages it *might* be. It is very prone to declaring
extra languages that are not really present due to it's design.

This is useful in the "if it can't be my language, then it's garbage"
sense, but not so useful in a "reject if it could be this language I
don't like".  You'd really want "reject if it *IS* this language I don't
like", but textcat doesn't tell you what language an email is, only a
set of what it might be.

  

Any chance someone might be interested in a radical redesign? I think language exclusion would be an extremely effective spam deterrent as email in a language you don't speak is definitely spam.

Doesn't Linux come with spelling dictionaries of words for a lot of languages that are somehow hashed for speed for spell checking lookups?

Except for very short messages I would think that if you spell checked the message in several languages and found that 80% was spelled correctly that you have a match. You wouldn't have to check every language, just start with some common ones and if you don't match them go to less common ones.

Would something like this be doable?

Reply via email to