On 10 October 2013 12:46, Ted Dunning <[email protected]> wrote: > For language detection, you are going to have a hard time doing better than > one of the standard packages for the purpose. See here: > > http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html >
Thanks for the pointer Ted. I'm a big fan of the Tika project, we use it for content extraction already. For various reasons though, we have rolled our own language detector (mainly, neither of these packages cover all of the languages we need to identify - language-detection doesn't do Catalan, Tika doesn't do Welsh). Dean.
