On May 27, 2010, at 1:05 PM, Ted Dunning wrote: > A bit off topic, but what you really want is collocations that bring > different information to the party than the constituent words.
What I'm after right now are things of the nature of: Here's what you can do _right now_ (i.e. very little coding) by combining Mahout (and other open source tools, but preferably Mahout) with search at some point in the chain (either indexing or searching). Collocations seemed like a nice fit since I know phrases and collocations are often a pretty decent win in search without too much work. Obviously, many other things fit here, too. I am also trying to answer the question of: Here's what you can do right now with Mahout as part of an "intelligent application", not necessarily search based (but still might use search under the hood). So, this leans a bit towards more BI-ish type things, like analytics, trend analysis, etc. So, things like tracking topics, phrases, keywords over time are often useful as well as the more obvious stuff like clustering, classification, etc. FWIW, the case for Mahout is already probably 5X what it was just 6 months ago. That is just beautiful. Suggestions welcome. > That is, you > need to detect cases where the "meaning" of the collocation is not > compositionally predicted by the meanings of the words in the collocation. > Simple collocation statistics really can't tell you that. Instead, you > need to look at the contexts in which the words appear. Context statistics > generally require a bit of smoothing, however, so you begin to step outside > of where LLR type methods will really help you out. SVD and random indexing Random indexing? Am I reading too much into that phrase beyond it's obvious meaning? If so, reference please. > are more likely to be what you need. The question becomes whether the > semantic vector for the pair is significantly different from the semantic > vector of either word or the average of the two. If so, the pair is > valuable. > > This computation is WAY more intense than collocation counting, > unfortunately, but LLR can be used to screen for the word pairs that are > candidates for this. At that point, the workload is plausible since you can > use something like an inverted index phrase search to get the statistics you > need. As always, Ted, it makes brilliant sense. -Grant
