Re: Language detection in TextCat

Henrik K Sun, 06 Dec 2009 20:56:56 -0800

On Sun, Dec 06, 2009 at 11:49:25PM -0500, Matt Kettler wrote:
> Marc Perkel wrote:
> > I'm wondering if the language detection in TextCat can be improved.
> > Here's the situation.
> >
> > It appears that TextCat was designed to be inclusive. You list the
> > languages you want and it returns many possibilities so as not to
> > trigger unwanted falsely.
> >
> > What I'm doing is extracting the language list for Exim where I hope
> > to offer a language reject list. The problem is that when you are
> > rejecting languages you want a smaller list that when you are
> > including languages to avoid false positives. I'd rather have a single
> > (non-english) result.
> >
> > I'm wondering if there's a way to add some more options to alter the
> > behavior of the plugin so it is more optimized towards the idea of
> > rejecting languages?
> >
> >
> The language detection would have to be radically redesigned to have
> enough accuracy support this.
> 
> Currently TextCat is a *very* crude match, and will often will return
> multiple languages for plain English text.
> 
> Textcat is not designed to decide what language the email is, but to
> find a set of languages it *might* be. It is very prone to declaring
> extra languages that are not really present due to it's design.
> 
> This is useful in the "if it can't be my language, then it's garbage"
> sense, but not so useful in a "reject if it could be this language I
> don't like".  You'd really want "reject if it *IS* this language I don't
> like", but textcat doesn't tell you what language an email is, only a
> set of what it might be.


Also beware of the case bug:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229

I've got ok results with my corpus with textcat_acceptable_score ~1.02 and
textcat_max_languages ~1-2. Of course I wouldn't plain reject anything..

Re: Language detection in TextCat

Reply via email to