For language detection, you are going to have a hard time doing better than one of the standard packages for the purpose. See here:
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones <[email protected]> wrote: > Hi Si, > > On 10 October 2013 07:59, <[email protected]> wrote: > > > > what do you mean by character n-grams? If you mean things like "&ab" or > "ui2" then given that there are so few characters compared to words is > there a problem that can't be solved without a look-up table for n<y (where > y <4ish ) > > > > Or are you looking at y >4 ish because if so then do you run into the > issue of a sudden space explosion? > > > > Yes, just tokens in a text broken up into sequences of their constituent > characters. In my initial tests, language detection works well where n=3, > particularly when including the head and tail bigrams. So I need something > to generate the required sequence files from my training data. >
