On May 27, 2010, at 11:58 AM, Ted Dunning wrote: > Just to forestall some effort on this, LLR is very good for threshold, but > the value is bad as a score so substituting TF or TFIDF is entirely > appropriate.
Good to know. > > There may be use cases for keeping LLR if only for diagnostic purposes. I just want to supplement my docs with some "high quality" collocations. TF-IDF is good enough, just not clear how best to get them out at this point, on a per doc basis. > > On Thu, May 27, 2010 at 8:52 AM, Drew Farris <[email protected]> wrote: > >>> 2. How can I, given a vector, get the top collocations for that Vector, >> as >>> ranked by LLR? >>> >> >> If I recall correctly, the LLR score gets dropped in seq2sparse in favor of >> TF or TFIDF depending on the nature of the vectors being generated. >> Meanwhile, CollocDriver simply emits a list of collocations in a collection >> ranked by llr, so neither is strictly what you are interested in. Is there >> a >> good way to include both something like TF >and< LLR in the output of >> seq2sparse -- would it be necessary to resort to emitting 2 separate sets >> of >> vectors? >>
