A bit off topic, but what you really want is collocations that bring different information to the party than the constituent words. That is, you need to detect cases where the "meaning" of the collocation is not compositionally predicted by the meanings of the words in the collocation. Simple collocation statistics really can't tell you that. Instead, you need to look at the contexts in which the words appear. Context statistics generally require a bit of smoothing, however, so you begin to step outside of where LLR type methods will really help you out. SVD and random indexing are more likely to be what you need. The question becomes whether the semantic vector for the pair is significantly different from the semantic vector of either word or the average of the two. If so, the pair is valuable.
This computation is WAY more intense than collocation counting, unfortunately, but LLR can be used to screen for the word pairs that are candidates for this. At that point, the workload is plausible since you can use something like an inverted index phrase search to get the statistics you need. On Thu, May 27, 2010 at 9:54 AM, Drew Farris <[email protected]> wrote: > On Thu, May 27, 2010 at 12:03 PM, Grant Ingersoll <[email protected] > >wrote: > > > > > I just want to supplement my docs with some "high quality" collocations. > > TF-IDF is good enough, just not clear how best to get them out at this > > point, on a per doc basis. > > > > You could use the CollocDriver to get a sense of the LLR range for your > corpus and then provide a minLLR as an argument to seq2sparse -- that > said, > it doesn't necessarilly address the issue of collocations with a high LLR > but are made up of words with a high frequency in the corpus. This might > not > be an issue for you however. >
