On Sun, Dec 06, 2009 at 11:49:25PM -0500, Matt Kettler wrote: > Marc Perkel wrote: > > I'm wondering if the language detection in TextCat can be improved. > > Here's the situation. > > > > It appears that TextCat was designed to be inclusive. You list the > > languages you want and it returns many possibilities so as not to > > trigger unwanted falsely. > > > > What I'm doing is extracting the language list for Exim where I hope > > to offer a language reject list. The problem is that when you are > > rejecting languages you want a smaller list that when you are > > including languages to avoid false positives. I'd rather have a single > > (non-english) result. > > > > I'm wondering if there's a way to add some more options to alter the > > behavior of the plugin so it is more optimized towards the idea of > > rejecting languages? > > > > > The language detection would have to be radically redesigned to have > enough accuracy support this. > > Currently TextCat is a *very* crude match, and will often will return > multiple languages for plain English text. > > Textcat is not designed to decide what language the email is, but to > find a set of languages it *might* be. It is very prone to declaring > extra languages that are not really present due to it's design. > > This is useful in the "if it can't be my language, then it's garbage" > sense, but not so useful in a "reject if it could be this language I > don't like". You'd really want "reject if it *IS* this language I don't > like", but textcat doesn't tell you what language an email is, only a > set of what it might be.
Also beware of the case bug: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229 I've got ok results with my corpus with textcat_acceptable_score ~1.02 and textcat_max_languages ~1-2. Of course I wouldn't plain reject anything..
