For picking terms from a document that stand apart from those in a large corpus, this tf*idf trick is nearly identical to using the latent log likelihood test. It produces pretty darned good results.
On Tue, Jul 17, 2012 at 8:22 PM, Ken Krugler <[email protected]>wrote: > The simplistic approach I used was to extract the top 50 terms (with > TF*IDF weights) from the target document, then use those terms (with > weights as boosts) to do a regular Lucene OR query & request the top 20 > hits. >
