Marc Perkel wrote: > I'm wondering if the language detection in TextCat can be improved. > Here's the situation. > > It appears that TextCat was designed to be inclusive. You list the > languages you want and it returns many possibilities so as not to > trigger unwanted falsely. > > What I'm doing is extracting the language list for Exim where I hope > to offer a language reject list. The problem is that when you are > rejecting languages you want a smaller list that when you are > including languages to avoid false positives. I'd rather have a single > (non-english) result. > > I'm wondering if there's a way to add some more options to alter the > behavior of the plugin so it is more optimized towards the idea of > rejecting languages? > > The language detection would have to be radically redesigned to have enough accuracy support this.
Currently TextCat is a *very* crude match, and will often will return multiple languages for plain English text. Textcat is not designed to decide what language the email is, but to find a set of languages it *might* be. It is very prone to declaring extra languages that are not really present due to it's design. This is useful in the "if it can't be my language, then it's garbage" sense, but not so useful in a "reject if it could be this language I don't like". You'd really want "reject if it *IS* this language I don't like", but textcat doesn't tell you what language an email is, only a set of what it might be.
