Hey Dean, what do you mean by character n-grams? If you mean things like "&ab" or "ui2" then given that there are so few characters compared to words is there a problem that can't be solved without a look-up table for n<y (where y <4ish )
Or are you looking at y >4 ish because if so then do you run into the issue of a sudden space explosion? Best Simon ---- Dr. Simon Thompson ________________________________________ From: Dean Jones [[email protected]] Sent: 09 October 2013 11:18 To: [email protected] Subject: Naive bayes and character n-grams Hello folks, I see that it's possible to use mahout to train a naive bayes classifier using n-grams as features (or I guess, strictly speaking, mahout can be used to generate sequence files containing n-grams; I suspect the naive bayes trainer is indifferent to the form of features it trains on). Is there any facility to generate character n-grams instead of word n-grams? Thanks, Dean.
