Redacted to pass the overly aggressive spam filter. On Mon, Jun 27, 2011 at 7:19 PM, Hector Yee <[email protected]> wrote:
> Just make the pattern a feature and feed it into the machine learning. > > e.g. if its a spam model and you notice v**gra is a spam term just make > feature 0 = "v**gra count" and the rest your regular bag of words. > > The only thing you have to be careful of is the relative weights between > each feature category. Typical normalizations is to L2 norm each feature > category separately before concatenation. > Another option is to use a "scale free" classification algorithm like > adaboost. > > > On Mon, Jun 27, 2011 at 5:51 PM, Patrick Collins < > [email protected]> wrote: > >> Has anyone got any advice on how to combine heuristics and classification? >> >> When preparing my data to build out the features to feed into my >> classification model I keep noticing patterns of text which I know with >> 99.99% probability implies a certain outcome. >> >> How would you construct the data/features in order to pre-classify this >> data to provide much more likelihood that the classifier comes to the >> "correct" conclusion? >> >> For example, I remember seeing an anti-spam machine which used a >> combination of fuzzy logic and then classification to build a better outcome >> (but he did not detail out how it was actually implemented). He used a whole >> range of heuristics to determine that a certain sender is known to be a >> spammer rather than just blindly passing this data in to the classifier. >> >> In my dataset I have a LOT of patterns like this that I can identify and >> then determine with very high probability the outcome. I say high >> probability, but I cannot say absolutely. Ideally if I could pre compute a >> lot of this data using heuristics I could feed this information in to the >> classifier to greatly reduce the number of features. But the classifiers do >> not allow me the ability to provide a "weight" to a certain feature. >> >> Other than "well just try and see what works", I was wondering how do >> people deal with this problem? Do they just leave it to the classifier and >> hope that the classifier picks up the same patterns? >> >> I'm a bit new to mahout and classification algorithms and so am just >> trying to get some input from how others might see this problem and whether >> I'm barking up the wrong tree. >> >> Patrick. >> > > > > -- > Yee Yang Li Hector > http://hectorgon.blogspot.com/ (tech + travel) > http://hectorgon.com (book reviews) > > -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
