A bit off topic, but what you really want is collocations that bring
different information to the party than the constituent words.  That is, you
need to detect cases where the "meaning" of the collocation is not
compositionally predicted by the meanings of the words in the collocation.
 Simple collocation statistics really can't tell you that.  Instead, you
need to look at the contexts in which the words appear.  Context statistics
generally require a bit of smoothing, however, so you begin to step outside
of where LLR type methods will really help you out.  SVD and random indexing
are more likely to be what you need.  The question becomes whether the
semantic vector for the pair is significantly different from the semantic
vector of either word or the average of the two.  If so, the pair is
valuable.

This computation is WAY more intense than collocation counting,
unfortunately, but LLR can be used to screen for the word pairs that are
candidates for this.  At that point, the workload is plausible since you can
use something like an inverted index phrase search to get the statistics you
need.

On Thu, May 27, 2010 at 9:54 AM, Drew Farris <[email protected]> wrote:

> On Thu, May 27, 2010 at 12:03 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> >
> > I just want to supplement my docs with some "high quality" collocations.
> >  TF-IDF is good enough, just not clear how best to get them out at this
> > point, on a per doc basis.
> >
>
> You could use the CollocDriver to get a sense of the LLR range for your
> corpus and then provide a minLLR as an argument to  seq2sparse -- that
> said,
> it doesn't necessarilly address the issue of collocations with a high LLR
> but are made up of words with a high frequency in the corpus. This might
> not
> be an issue for you however.
>

Reply via email to