Re: Naive bayes and character n-grams

Suneel Marthi Thu, 10 Oct 2013 06:20:36 -0700

Dean,

Just a thought.

You should be able to create new language models (with LangDetect) if there's 
Wikipedia content for the specific language,
had to do it in the past for Pashto and Malaysian.

On Thursday, October 10, 2013 8:16 AM, Dean Jones <[email protected]> 
wrote:

On 10 October 2013 12:46, Ted Dunning <[email protected]> wrote:
> For language detection, you are going to have a hard time doing better than
> one of the standard packages for the purpose.  See here:
>
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>

Thanks for the pointer Ted. I'm a big fan of the Tika project, we use
it for content extraction already. For various reasons though, we have
rolled our own language detector (mainly, neither of these packages
cover all of the languages we need to identify - language-detection
doesn't do Catalan, Tika doesn't do Welsh).

Dean.

Re: Naive bayes and character n-grams

Reply via email to