Re: Tags generation?

Pat Ferrel Fri, 03 Aug 2012 09:39:05 -0700

We do what Ted describes by tossing frequently used terms with the IDF max, 
tossing stop words and stemming with a lucene analyzer. The stemming makes the 
tags less readable for sure but without it the near duplicate terms make for a 
strange looking tag list. With or without stemming the top TFIDF terms work 
rather well for tags.

If you are using tags in a UI the question becomes, what do you do when a user 
selects a tag? The classical answer is search for that term but if you do that 
you throw away the vector signature and are doing a single word search. We are 
planning to do a reweighing of the term vector and using it to do a 
"MoreLikeThis" Solr search, if we every get to it...

On Aug 3, 2012, at 12:43 AM, Ted Dunning <[email protected]> wrote:

tf-idf is a good approximation of the LLR score for many applications and
often gives useful signatures although not always super pretty.

It helps to have an overall minimum document frequency for terms of the be
considered for being tags.  This is the same as an IDF maximum.

On Fri, Aug 3, 2012 at 1:35 AM, Lance Norskog <[email protected]> wrote:

> I'm looking for a good tags generator. A function from document/term
> matrix onto term list is a good bet, since it creates an analysis of
> the interplay of document and term. I have an LSA implementation for
> grinding on document/term matrices. This is very effective but seems
> overkill. Is there a simpler function from a document/term matrix onto
> a terms list? Maybe the mean tf-idf or log-entropy?
> 
> --
> Lance Norskog
> [email protected]
>

Re: Tags generation?

Reply via email to