For retrieval, I have had very good results in just retaining high LLR collocations and letting any subsequent processing deal with the weighting.
On the other hand, I just saw this article which tested collocations for spam detection and got no lift because the individual constituent words were carrying the weight already. ( http://www.aueb.gr/users/ion/docs/TR2004_updated.pdf linked from http://aclweb.org/aclwiki/index.php?title=Spam_filtering_datasets) On Thu, May 27, 2010 at 9:03 AM, Grant Ingersoll <[email protected]>wrote: > > There may be use cases for keeping LLR if only for diagnostic purposes. > > I just want to supplement my docs with some "high quality" collocations. > TF-IDF is good enough, just not clear how best to get them out at this > point, on a per doc basis.
